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Abstract 

The hidden Markov model (HMM) is a widely-used generative model that copes with 
sequential data, assuming that each observation is conditioned on the state of a hidden 
Markov chain. In this paper, we derive a novel algorithm to cluster HMMs based on the 
hierarchical EM (HEM) algorithm. The proposed algorithm i) clusters a given collection 
of HMMs into groups of HMMs that are similar, in terms of the distributions they rep- 
resent, and ii) characterizes each group by a "cluster center", i.e., a novel HMM that is 
representative for the group, in a manner that is consistent with the underlying generative 
model of the HMM. To cope with intractable inference in the E-step, the HEM algorithm 
is formulated as a variational optimization problem, and efficiently solved for the HMM 
case by leveraging an appropriate variational approximation. The benefits of the proposed 
algorithm, which we call variational HEM (VHEM), are demonstrated on several tasks 
involving time-series data, such as hierarchical clustering of motion capture sequences, and 
automatic annotation and retrieval of music and of online hand-writing data, showing im- 
provements over current methods. In particular, our variational HEM algorithm effectively 
leverages large amounts of data when learning annotation models by using an efficient hi- 
erarchical estimation procedure, which reduces learning times and memory requirements, 
while improving model robustness through better regularization. 

Keywords: Hierarchical EM algorithm, clustering, hidden Markov model, hidden Markov 
mixture model, variational approximation, time-series classification 



1. Introduction 



The hidden Markov model (HMM) (Rabiner and Juang, 1993) is a probabilistic model 
that assumes a signal is generated by a double embedded stochastic process. A discrete- 
time hidden state process, which evolves as a Markov chain, encodes the dynamics of the 
signal, while an observation process encodes the appearance of the signal at each time, 
conditioned on the current state. HMMs have been successfully applied to a variety of 
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fields, including speech recognition (Rabiner and Juang, 1993), music analysis (Qi et al. 



2007[ ) and identification (Batlle et al. 2002), online hand-writing recognition (Nag et al. 
1986), analysis of biological sequences (Krogh et al. 1994), and clustering of time series 



data ( |Jebara et al.[ |2007[ |Smyth[ |1997 ). 

This paper is about clustering HMMs. More precisely, we are interested in an algorithm 
that, given a collection of HMMs, partitions them into K clusters of "similar" HMMs, 
while also learning a representative HMM "cluster center" that concisely and appropriately 
represents each cluster. This is similar to standard k-means clustering, except that the data 
points are HMMs now instead of vectors in ~K d . 

Various applications motivate the design of HMM clustering algorithms, ranging from 
hierarchical clustering of sequential data (e.g., speech or motion sequences modeled by 
HMMs (Jebara et al. 2007)), to hierarchical indexing for fast retrieval, to reducing the 



computational complexity of estimating mixtures of HMMs from large datasets (e.g., seman- 
tic annotation models for music and video) - by clustering HMMs, efficiently estimated 
from many small subsets of the data, into a more compact mixture model of all data. 
However, there has been little work on HMM clustering and, therefore, its applications. 

Existing approaches to clustering HMMs operate directly on the HMM parameter space, 
by grouping HMMs according to a suitable pairwise distance defined in terms of the HMM 
parameters. However, as HMM parameters lie on a non-linear manifold, a simple application 
of the k-means algorithm will not succeed in the task, since it assumes real vectors in a 
Euclidean space. In addition, such an approach would have the additional complication 
that HMM parameters for a particular generative model are not unique, i.e., a permutation 



of the states leads to the same generative model. One solution, proposed by Jebara et al. 



(2007), first constructs an appropriate similarity matrix between all HMMs that are to 
be clustered (e.g., based on the Bhattacharya affinity, which depends non-linearly on the 
HMM parameters (Jebara et al. 2004)), and then applies spectral clustering. While this 
approach has proven successful to group HMMs into similar clusters (Jebara et al. 2007), 
it does not directly address the issue of generating HMM cluster centers. Each cluster can 
still be represented by choosing one of the given HMMs, e.g., the HMM which the spectral 
clustering procedure maps the closest to each spectral clustering center. However, this 
may be suboptimal for some applications of HMM clustering, for example in hierarchical 
estimation of annotation models. Another distance between HMM distributions suitable 
for spectral clustering is the KL divergence, which in practice has been approximated by 
sampling sequences from one model and computing their log-likelihood under the other 
(Juang and Rabiner, 1985 Zhong and Ghosh 2003; Yin and Yang 2005). 

Instead, in this paper we propose to cluster HMMs directly with respect to the probability 
distributions they represent. The probability distributions of the HMMs are used through- 
out the whole clustering algorithm, and not only to construct an initial embedding as in 
(Jebara et al. , 2007). By clustering the output distributions of the HMMs, marginalized over 
the hidden-state distributions, we avoid the issue of multiple equivalent parameterizations 
of the hidden states. We derive a hierarchical expectation maximization (HEM) algorithm 
that, starting from a collection of input HMMs, estimates a smaller mixture model of HMMs 
that concisely represents and clusters the input HMMs (i.e., the input HMM distributions 
guide the estimation of the output mixture distribution). 
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The HEM algorithm is a generalization of the EM algorithm — the EM algorithm can 
be considered as a special case of HEM for a mixture of delta functions as input. The main 
difference between HEM and EM is in the E-step. While the EM algorithm computes the 
sufficient statistics given the observed data, the HEM algorithm calculates the expected suf- 
ficient statistics averaged over all possible observations generated by the input probability 
models. Historically, the first HEM algorithm was designed to cluster Gaussian probabil- 
ity distributions (Vasconcelos and Lippman, 1998). This algorithm starts from a Gaussian 
mixture model (GMM) with components and reduces it to another GMM with fewer 
components, where each of the mixture components of the reduced GMM represents, i.e., 
clusters, a group of the original Gaussian mixture components. More recently, |Chan et al. 



( |2010a| ) derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear 
dynamical systems, LDSs) through their probability distributions. HEM has been applied 
successfully to construct GMM hierarchies for efficient image indexing (Vasconcelos, 2001), 
to cluster video represented by DTs (Chan et al. 2010b), and to estimate GMMs or DT 
mixtures (DTMs, i.e., LDS mixtures) from large datasets for semantic annotation of im- 



ages (Carneiro et al. 2007), video (Chan et al. 2010b) and music (Turnbull et al. 2008 



ages ( 


Carneiro et al. 


Coviello et al. , 


2011) 



To extend the HEM framework for GMMs to mixtures of HMMs (or hidden Markov mix- 
ture models, H3Ms), additional marginalization over the hidden-state processes is required, 
as with DTMs. However, while Gaussians and DTs allow tractable inference in the E-step of 
HEM, this is no longer the case for HMMs. Therefore, in this work, we derive a variational 
formulation of the HEM algorithm (VHEM), and then leverage a variational approximation 
derived by Hershey et al. (2008) (which has not been used in a learning context so far) 
to make the inference in the E-step tractable. The resulting algorithm not only clusters 
HMMs, but also learns novel HMMs that are representative centers of each cluster. The 
resulting VHEM algorithm can be generalized to handle other classes of graphical models, 
for which exact computation of the E-step in the standard HEM would be intractable, by 
leveraging similar variational approximations — e.g., the more general case of HMMs with 
emission probabilities that are (mixtures of) continuous exponential family distributions. 

Compared to the spectral clustering algorithm of (Jebara et al. 2007), the VHEM al- 
gorithm has several advantages that make it suitable for a variety of applications. First, 
the VHEM algorithm is capable of both clustering, as well as learning novel cluster centers, 
in a manner that is consistent with the underlying generative probabilistic framework. In 
addition, since it does not require sampling steps, it is also scalable with low memory re- 
quirements. As a consequence, VHEM for HMMs allows for efficient estimation of HMM 
mixtures from large datasets using a hierarchical estimation procedure. In particular, in- 
termediate HMM mixtures are first estimated in parallel by running the EM algorithm on 
small independent portions of the dataset. The final model is then estimated from the 
intermediate models using the VHEM algorithm. Because VHEM is based on maximum- 
likelihood principles, it drives model estimation towards similar optimal parameter values 
as performing maximum-likelihood estimation on the full dataset. In addition, by averaging 
over all possible observations compatible with the input models in the E-step, VHEM pro- 
vides an implicit form of regularization that prevents over-fitting and improves robustness 
of the learned models, compared to a direct application of the EM algorithm on the full 
dataset. Note that, in contrast to (Jebara et al. 2007), VHEM does not construct a kernel 
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embedding, which is a costly operation that requires calculating of 0((K^) 2 ) pairwise sim- 
ilarity scores between all input HMMs to form the similarity matrix, as well as the inversion 
of this matrix. 

In summary, the contributions of this paper are three-fold: i) we derive a variational for- 
mulation of the HEM algorithm for clustering HMMs, which generates novel HMM centers 
representative of each cluster; ii) we evaluate VHEM on a variety of clustering, annota- 
tion, and retrieval problems involving time-series data, showing improvement over current 
clustering methods; iii) we demonstrate in experiments that VHEM can effectively learn 
HMMs from large sets of data, more efficiently than standard EM, while improving model 
robustness through better regularization. With respect to our previous work, the VHEM 



algorithm for H3M was originally proposed in (Coviello et al. 2012a) 

The remainder of the paper is organized as follows. We review the hidden Markov 
model (HMM) and the hidden Markov mixture model (H3M) in Section [2j We present the 
derivation of the VHEM-H3M algorithm in Section [3j followed by a discussion in Section |4j 
Finally, we present experimental results in Sections [5] and [6j 

2. The hidden Markov (mixture) model 

A hidden Markov model (HMM) M. assumes a sequence of r observations y\- T = {y\, . . . , y T } 
is generated by a double embedded stochastic process, where each observation (or emission) 
yt at time t depends on the state of a discrete hidden variable xt , and the sequence of hidden 
states X\. T = {xi, . . . ,x T } evolves as a first-order Markov chain. The hidden variables can 
take one of S values, {1, . . . , 5}, and the evolution of the hidden process is encoded in a state 
transition matrix A = [ap i /3']/3 j pi=i ) ... ) s, where each entry, a$$i = p(xt+i = (3'\x t = 
is the probability of transitioning from state j3 to state (3' , and an initial state distribution 
7T = [m, . . . , its], where = p(x\ = fi\M). 

Each state j3 generates observations according to an emission probability density func- 
tion, p(yt\xt = ft, M). Here, we assume the emission density is time-invariant, and modeled 
as a Gaussian mixture model (GMM) with M components: 

M 

p(y\x = p,M)=^2 c^ m p(y\C = m,x = (3, M), (1) 

m=l 

where £ ~ multinomial(c« i, . . . , cr m) 1S t ne hidden assignment variable that selects the 
mixture component, with cg )?n as the mixture weight of the mth component, and each 
component is a multivariate Gaussian distribution, 

p(y\C = m, x = fi, M) = M{y; fMfi >m , £a, OT ), (2) 
with mean ixp >m and covariance matrix Yip )Tn . The HMM is specified by the parameters 

M = {it, A, {{cp >m , (J,p )Tn , Ep,m}m=l}p=l}> ( 3 ) 
which can be efficiently learned from an observation sequence y\- T with the Baum- Welch 



algorithm (Rabiner and Juang, 1993), which is based on maximum likelihood estimation. 
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The probability distribution of a state sequence x\- T generated by an HMM Ai is 

r t 

p(xi: T \M) =p(xi\M)Y[p(x t \x t -i,M) = ir xi Y[a Xt _ ltXt , (4) 

t=2 t=2 

while the joint likelihood of an observation sequence y\- T and a state sequence x\- T is 

r t 

p{yi:r,X 1:T \M) = p(yi:r\xi: T ,M)p(x 1 : T \M) = p{x X \M) \\ p{x t \x t -l , M) \\ p{yt \x t , M ) . 



t=2 



t=l 



(5) 



Finally, the observation likelihood of y\- T is obtained by marginalizing out the state sequence 
from the joint likelihood, 



P(yi:r\M) = ^2p(yirr,Xl. T \M) = ^ p(yi:r \xi :T , M)p(x 1:T \M ) , 



(6) 



where the summation is over all state sequences of length r, and can be performed efficiently 



using the forward algorithm (Rabiner and Juang, 1993). 



A hidden Markov mixture model (H3M) (Smyth, 1997) models a set of observation se- 
quences as samples from a group of K hidden Markov models, each associated to a specific 
sub-behavior. For a given sequence, an assignment variable z ~ multinomial (wi, • • • ,ujk) 
selects the parameters of one of the K HMMs, where the kth. HMM is selected with prob- 
ability wjfc. Each mixture component is parametrized by 

■M. z = {tt z , A z , {{c z p m , fJ.p }Tn , T,^ m }^ =l }^ =1 } , (7) 

and the H3M is parametrized by M. = {w z , A4 Z }^ =1 , which can be estimated from a collec- 



tion of observation sequences using the EM algorithm (Smyth, 1997). 



To reduce clutter, here we assume that all the HMMs have the same number S of hidden 
states and that all emission probabilities have M mixture components. Our derivation could 
be easily extended to the more general case though. 

3. Clustering hidden Markov models 

Algorithms for clustering HMMs can serve a wide range of applications, from hierarchical 



clustering of sequential data (e.g., speech or motion sequences modeled by HMMs ( Jebara 



et al. 2007)), to hierarchical indexing for fast retrieval, to reducing the computational 
complexity of estimating mixtures of HMMs from large weakly-annotated datasets - - by 
clustering HMMs, efficiently estimated from many small subsets of the data, into a more 
compact mixture model of all data. 

In this work we derive a hierarchical EM algorithm for clustering HMMs (HEM-H3M) 
with respect to their probability distributions. We approach the problem of clustering 
HMMs as reducing an input HMM mixture with a large number of components to a new 
mixture with fewer components. Note that different HMMs in the input mixture are allowed 
to have different weights (i.e., the mixture weights {co z }^ =1 are not necessarily all equal). 
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One method for estimating the reduced mixture model is to generate samples from 
the input mixture, and then perform maximum likelihood estimation, i.e., maximize the 
log-likelihood of these samples. However, to avoid explicitly generating these samples, we 
instead maximize the expectation of the log-likelihood with respect to the input mixture 
model, thus averaging over all possible samples from the input mixture model. In this way, 
the dependency on the samples is replaced by a marginalization with respect to the input 
mixture model. While such marginalization is tractable for Gaussians and DTs, this is no 
longer the case for HMMs. Therefore, in this work, we i) derive a variational formulation 
of the HEM algorithm (VHEM), and ii) specialize it to the HMM case by leveraging a 



variational approximation proposed by Hershey et al. (2008). Note that (Hershey et al 



2008) was proposed as an alternative to MCMC sampling for the computation of the KL 



divergence between two HMMs, and has not been used in a learning context so far. 



We present the problem formulation in Section 3.1 and derive the algorithm in Sections 



3.2, 3.3 and 3.4 



3.1 Formulation 

Let be a base hidden Markov mixture model with K^ 3 ' components. The goal of the 

VHEM algorithm is to find a reduced hidden Markov mixture model with < 
(i.e., fewer) components that represents well. The likelihood of a random sequence 

Vv.r ~ M.( b ' is given by 

K {b) 

P (yi:r\M^) = ^ W (% :r |zW=i,MW), (8) 
i=l 

where multinomial (w£ , ••• ) is the hidden variable that indexes the mixture 

components. p{y\- T \z = i,A4^) is the likelihood of y\- T under the ith. mixture component, 
as in (6), and cj- is the mixture wei 
the random sequence y\- T ~ IS 



as in (6), and cj- is the mixture weight for the ith component. Likewise, the likelihood of 



P (yi:r\M^) = Y,^ r) P(yi:r\z (r) =j,M^), (9) 

i=i 

where multinomial ( uJ[\ ■■ ■ ,to^\ r) ) is the hidden variable for indexing components 

in M^. 

At a high level, the VHEM-H3M algorithm estimates the reduced H3M model MS r > in 
Q from virtual sequences distributed according to the base H3M model M. W in ([s]) . From 
this estimation procedure, the VHEM algorithm provides: 

1. a soft clustering of the orig inal components into 

K [r) 

groups, where cluster 

membership is encoded in assignment variables that represent the responsibility of 
each reduced mixture component for each base mixture component, i.e., Zij = p(z^ r ' = 
j\zV> = i), for i = l,...,K^ and j = 1, . . . , ifW; 

2. novel cluster centers represented by the individual mixture components of the reduced 
model in (§), i.e., p{yv. T \z (r) =j,M^) for j = l,...,K^. 
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Table 1: Notation used in the derivation of the VHEM-H3M algorithm. 



variables 


base model (b) 


reduced model (r) 


1 C TT1\ Tl\ T j 

index tor HMM components 


i 


3 


n n in ri p r n f rT A 1 A 1 c r\ m ti c\t\ pti t" Q 

11U111UC1 UI J.X1VJ.1VJ. (_*WllCllLo 


K( b ) 


K( r ) 


HMM states 




P 


number of HMM states 


S 


s 


HMM state sequence 


/?l:r = {01>" - 


PV.T = {Pi,'" ,Pt} 


index for component of GMM 


m 


i 


number of Gaussian components 


M 


M 








XlOlvl 




ka( t ) 


HMM component (of H3M) 


M (h) 


M {r) 

3 


GMM emission 


M ib l 


M {r) 
3 A 


Hq in cci q ri rnmnnnpnt ( nf OA/TA/Ti 

V_f cl 11 o o 1 £l 1 1 (_-UlilJJUllClllj 1 Ul VjIlV-LlVJ. 1 


M ib) 


■ h,p/ 


parameters 


" {b) = {^ b) } 

_(&),« _ r_.W.n 
n — I 71 /? J 


w« = {wj r) } 


H3M mixture weights 


HA/IA/T initial Qtatp 


_(r)j _ r Wj'i 

n — \ n p J 


HMM state transition matrix 


A ~ l a 0,8'\ 


4(r),j _ \ J r )j] 
{c ( ^At (r K£ (r H}f 1 

>'/9,< ' p,i St=l 


GMM emission 


J>(*)>* .,( b )>« y( b ),i\M 
1 0,m> r"0,m' @,m)m=l 


probability distributions 


notation 


short-hand 


HMM state sequence (b) 


p(xi :T = /3i^|zW=i,A<W) 




HMM state sequence (r) 


P(xi:r = Pl:r\z (r) =j,M^) 


P (p 1 :r\M ( p)=^ 


HMM observation likelihood (r) 


p{yi-.M r) =i,M {T) ) 


p(yi:r\M^ ] ) 


GMM emission likelihood (r) 


p{yt\x t = p,M ( p) 


p(.yt\M ( ;> p ) 


Gaussian component likelihood (r) 


p(y t \(t = e,x t = p,M { ; ) ) 


p(yt\M^ e ) 


expectations 




HMM observation sequence (b) 


E y 1:T \z( b )=i,M( b l['] 


E M m [•] 


GMM emission (b) 


%\x t =p,M?^ 




Gaussian component (b) 


E y t |Ct=m,x t =^,7W, (i,) ^ 





expected log-likelihood lower bound variational distribution 

E M m [\ogp(Yi\M^)] C l mM qi (zi = j) = Zij 

Ej«pOgp(lfl :T |A<5 r) )] £^ MM g' J '(pi^|/3l:r) = 

= W{pi\Pi)Y[UW{pt\Pt-i,Pt) 
* M( ^°zp{y\M^)\ 4ff#>> q g {c = e]m) = r ,m,M 



Finally, because we take the expectation over the virtual samples, the estimation is carried 
out in an efficient manner that requires only knowledge of the parameters of the base model, 
without the need of generating actual virtual samples. 

Notation. We will always use i and j to index the components of the base model and 
the reduced model Ai^ r \ respectively. To reduce clutter, we will also use the short-hand 
notation Aif^ and to denote the ith component of and the jth component of 
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MS r \ respectively. Hidden states of the HMMs are denoted with /3 for the base model 

(b) (r) 

M\ , and with p for the reduced model M.^ . 

(b) (r) 

The GMM emission models for each hidden state are denoted as M) I and M\ ■'. We will 
always use m and I for indexing the individual Gaussian components of the GMM emissions 
of the base and reduced models, respectively. The individual Gaussian components are 

(b) (r) 

denoted as M.\ L m for the base model, and -Mj p £ for the reduced model. Finally, we 
denote the parameters of zth HMM component of the base mixture model as = 
{7r( 6 )'%A( 6 )'\{{c^\M^\E^ ) r ; i }^ =1 }| =1 }, and for the jth HMM in the reduced mixture 

as Mf = {Tr^iAW^^.^.E^}^}^}. 

When appearing in a probability distribution, the short-hand model notation (e.g., 
M.}) always implies conditioning on the model being active. For example, we will use 
PiVV-A-M-i ) as short-hand for p(yi-. T \ = i,A4^), or p(yt\-M fl) as short-hand for p(yt\xt = 
(3,z^ = i,A4^). Furthermore, we will use tt^' 4 as short-hand for the probability of the 
state sequence j3x :T according to the base HMM component Ai- , i.e., p(/3i :T |.M,- ), and 
likewise Mp r }.'^ for the reduced HMM component. 

Expectations will also use the short-hand model notation to imply conditioning on the 
model. In addition, expectations are assumed to be taken with respect to the output variable 
(yi :r or yt), unless otherwise specified. For example, we will use E »>(&)[•] as short-hand for 

i 

E y 1:T | 2 (i>)=i,X(b) [•]■ 

Table [T] summarizes the notation used in the derivation, including the variable names, 
model parameters, and short-hand notations for probability distributions and expectations. 
The bottom of Table □ also summarizes the variational lower bound and variational distri- 
butions, which will be introduced subsequently. 

3.2 Variational HEM algorithm 

To learn the reduced model in ([9]), we consider a set of N virtual samples, distributed 
according to the base model J\A^ in ([s]), such that iVj = Ncof^ samples are drawn from 
the ith component. We denote the set of Ni virtual samples for the ith component as 
Yi = {y^!^ 71 }^t=i> where y±.'™^ ~ Mf \ and the entire set of N samples as Y = {1^}^'. 
Note that, in this formulation, we are not considering virtual samples {x^!™\ y^.'^} for each 
base component, according to its joint distribution p(x\- T , yi :T \-M.\ )■ The reason is that the 

hidden-state space of each base mixture component Aif^ may have a different representation 

(e.g., the numbering of the hidden states may be permuted between the components). This 

(r) 

mismatch will cause problems when the parameters of Aij are computed from virtual 

samples of the hidden states of {M^yf^i ■ Instead, we treat X% = {xi'^}^ =1 as "missing" 
information, and estimate them in the E-step. The log-likelihood of the virtual samples is 

K (b) 

logp(Y\M^) = logp(^|-M (r) ), (10) 

8=1 
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where, in order to obtain a consistent clustering, we assume the entirety of samples Yi is 



assigned to the same component of the reduced model (Vasconcelos and Lippman, 1998) 



The original formulation of HEM (Vasconcelos and Lippman, 1998) maximizes (10) 



with respect to Ai^ r \ and uses the law of large numbers to turn the virtual samples Yi 
into an expectation over the base model components J\A.\ . In this paper, we will start 
with a different objective function to derive the VHEM algorithm. To estimate A4^ r \ we 
will maximize the average log- likelihood of all possible virtual samples, weighted by their 
likelihood of being generated by -M^ , i.e., the expected log- likelihood of the virtual samples, 



J(M {r) ) =E 



M0>) 



logp{Y\M {r) [ 



K Q>) 



M 



(b) 



\ogp{Xi\M^) 



(11) 



where the expectation is over the base model components Aif^. Maximizing ( |ll[ ) will 
eventually lead to the same estimate as maximizing (10), but allows us to strictly preserve 



the variational lower bound, which would otherwise be ruined when applying the law of 



large numbers to (10). 



A general approach to deal with maximum likelihood estimation in the presence of 



hidden variables (which is the case for H3Ms) is the EM algorithm (Dempster et al. 1977). 



In the traditional formulation the EM algorithm is presented as an alternation between 
an expectation step (E-step) and a maximization step (M-step). In this work, we take a 



variational perspective (Neal and Hinton 1998 Wainwright and Jordan, 2008; Csisz et al. 



1984), which views each step as a maximization step. The variational E-step first obtains a 
family of lower bounds to the (expected) log- likelihood (i.e., to (fTi) 



indexed by variational 

parameters, and then optimizes over the variational parameters to find the tightest bound. 
The corresponding M-step then maximizes the lower bound (with the variational parameters 
fixed) with respect to the model parameters. One advantage of the variational formulation is 
that it readily allows for useful extensions to the EM algorithm, such as replacing a difficult 
inference in the E-step with a variational approximation. In practice, this is achieved by 
restricting the maximization in the variational E-step to a smaller domain for which the 
lower bound is tractable. 



3.2.1 Lower bound to an expected log-likelihood 

Before proceeding with the derivation of VHEM for H3Ms, we first need to derive a lower- 
bound to an expected log-likelihood term, e.g., (11). In all generality, let {0,H} be the 



observation and hidden variables of a probabilistic model, respectively, where p(H) is the 
distribution of the hidden variables, p(0\H) is the conditional likelihood of the observations, 
and p(O) = Y^h P(0\H)p(H) is the observation likelihood. We can define a variational 
lower bound to the observation log-likelihood ( | Jordan et al. , |1999 Jaakkola , 2000 ) : 



logp(O) > logp(O) 

= ^g(F)log 



D(q(H)\\p(H\0)) 
p{H)p{0\H) 



q(H) 



(12) 
(13) 



where p(H\0) is the posterior distribution of H given observation O, and D(p\\q) = 
J p(y) log ^jdy is the Kullback-Leibler (KL) divergence between two distributions, p and 
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q. We introduce a variational distribution q(H), which approximates the posterior dis- 
tribution, where Y2h Q(H) = 1 ano - — 0- When the variational distribution equals 
the true posterior, q(H) = P(H\0), then the KL divergence is zero, and hence the lower- 
bound reaches logp(O). When the true posterior cannot be computed, then typically q is 
restricted to some set of approximate posterior distributions Q that are tractable, and the 
best lower-bound is obtained by maximizing over q £ Q, 



logp(O) > max q(H) 



log 



p(H)p(Q\H) 



(14) 



Using the lower bound in (14), we can now derive a lower bound to an expected log- 



likelihood expression. Let Ef,[-] be the expectation with respect to O with some distribution 



Pb{0). Since Pb{0) is non-negative, taking the expectation on both sides of (14) yields, 



E b pogp(O)] >E 6 



rmi p(H)p(Q\H) 



> maxEfc 



max V q{H) \ log ^ + E & [\ogp{0\H)\ 

qeQ H ^ 9 ' > 



(15) 

(16) 
(17) 



where (16) follows from Jensen's inequality (i.e., /(E[x]) < E[/(x)] when / is convex), and 



the convexity of the max function. Hence, (17) is a variational lower bound on the expected 



log-likelihood, which depends on the family of variational distributions Q. 



3.2.2 Variational lower bound 



We now derive a lower bound to the expected log- likelihood cost function in (11). The 



derivation will proceed by successively applying the lower bound from (17) to each expected 



log-likelihood term that arises. This will result in a set of nested lower bounds. We first 
define the following three lower bounds: 



E M ( 6) [logp(yi|A<W)]>4 



E 



M w [log p(y 1:T \M) ')} > C l HMM , 



H3M) 

hi 



E Mm [logp(y\M^)]>£ 



GMM 



(18) 
(19) 
(20) 



The first lower bound, £%j 3 m, is on the expected log-likelihood of an H3M Jv[^ with respect 



to an HMM M.f ] . The second lower bound, C l g MM , is on the expected log-likelihood of an 

Although the 

can be computed exactly using the forward algorithm 



HMM J^f \ averaged over observation sequences from a different HMM Ai[ ■ 
data log-likelihood logp(yi :T \A4 ( f^) 



(Rabiner and Juang, 1993), calculating its expectation is not analytically tractable since an 



j(r) - , 

observation sequence y\- T from a HMM M - is essentially an observation from a mixture 
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model] The third lower bound, £qmm ■> ls on ^ ne ex P ec ted log-likelihood of a GMM 

• (r) (b) 

emission density with respect to another GMM Ai) L This lower bound does not 

depend on time, as we have assumed that the emission densities are time- invariant. 



Looking at an individual term in (11), p(Yi\A4^) is the likelihood under a mixture of 



HMMs, as in Q, where the observation variable is Y{ and the hidden variable is Zi (the 
assignment of to a component of A4^). Hence, introducing the variational distribution 



qi(zi) and applying (17), we have 



E w [logp(l-|A^W 



> maxV^ = i) \log p{Zl J l^ (r)) +E wpogp^lA^f)] 

1i . Qi\ z i — 3 ) 1 



max 



max 

5i 



£ft(* =3) (log + E ,^> [^p(yi:r\Mt^ 



X^O* = j) < log 



Qi(Zi = J) 

p(z t =j\M^) 
Qi{zi = j) 



+ N i V Mi , ) [\ogp(y 1 .. T \Mf)] 



(^)^ 

3 



(21) 
(22) 
(23) 



where in (22) we use the fact that Yi is a set of N{ i.i.d. samples. In (23), \ogp{y\- T \JV[y) 
is the observation log-likelihood of an HMM, which is essentially a mixture distribution, 
and hence its expectation cannot be calculated directly. Instead, we use the lower bound 



£-hmm defined in (119b, yielding 



\ogp{Y l \M i - r) ) 



. V- / -N J 1 g(g = il-^ (r) ) , M r i,j \ A ri 

> max 2^ = J) < log — „,( ^, _ „■>> — + N i £ HMM ( = £hsm- 



i(Zi =j) 



(24) 



Next, we calculate the lower bound ^tjmm- Starting with (19), we first rewrite the 



expectation ~E> M m to explicitly marginalize over the HMM state sequence j3\ :T from J^i\ b \ 



^ M ^ogp(yi-.r\M ( [ ) )}=E 0iTlMf) 



VI:t\P\:t,M 



(25) 
(26) 



For the HMM likelihood p{yv. T \Mj), given by (6), the observation variable is y\- T and the 
hidden variable is the state sequence p\- T . We therefore introduce a variational distribution 
<7 ij (P1:t|/3i:t) on the state sequence p\ :T , which depends on a particular sequence j3\- T from 



1. For an observation sequence of length r, an HMM with S states can be considered as a mixture model 
with 0(S T ) components. 
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M.f>. Applying (17) to (26), we have 



V M{b) [\ogp{ yi .. T \M { p)] 



> 



\- (b),i if/ IP v /, Kttw- -M} 

2^ 4l ' maX E « J (Pl:r /3l:r) < log ^ rf- 

Pl.t t ,j Z g^(pi:r 

P1:t " 



P1:t 



p(pi:r|MS r) 



Put 



Pi: 



^ E^^E^WlA-) <! log 



P(P1:t\M { ; ] 



Put 



Pi: 



V.T 



(27) ' 



+ E^ 



l i,0t 



GMM ( — z - 



HMM' 



(29) 



where in (28) we use the conditional independence of the observation sequence given the 
state sequence, and in (29) we use the lower bound, defined in (20), on each expectation. 

Finally, we derive the lower bound £q'mm f° r (20). First, we rewrite the expectation 
with respect to M^l to explicitly marginalize out the GMM hidden assignment variable £, 



E M ^[\ogp(y\M^ p )] =- E 



M 

E 



E M w ( \}og P (y\MV)] 



m m 



m=l 



logp(y\M)[ 



hp' 



(30) 
(31) 



Note that p(y\A4^' p ) is a GMM emission distribution as in (1). Hence, the observation vari- 
able is y, and the hidden variable is £. Therefore, we introduce the variational distribution 
(Cl m )j which is conditioned on the observation y arising from the mth component in 



,(&) 



M-"p, and apply (|17|) 



E M ^[\ogp(y\M^ p )] 



M M (- r)(C\M (r) ) 

* E ft^E&KM log + E M w Pogp(y|A<« f; 

m=l 9 ,s',p C=l I 9/8 >/ 



(C|m) 



i,f3 ,m 



A /»(*>/8)»(3jP) 
— '-GMM ' 



(32) 



where E [logp(y|-M^j «)] is the expected log-likelihood of the Gaussian distribution 



•A4^j^ with respect to the Gaussian M\ which has a closed- form solution (see Section 



3.3.1). 



In summary, we have derived a variational lower bound to the expected log-likelihood 
of the virtual samples in (11), 



J{M^) = E M(h) [\ogp(Y\M^)] >^4m> 

8=1 



(33) 
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which is composed of three nested lower bounds, corresponding to different model elements 
(the H3M, the component HMMs, and the emission GMMs), 



:W = mflx5>(* = j) (iog p( V J ' |A f )) + W U 



HMM 



c 



GMM 



Qi(zi = j) 



i'~HA4M ( ' 



-GMM 



ft :T 9 ' PUT 

M M r 

«/& C=l I 



m=l 



771 



i,j3,m 



(34) 
(35) 



j (Pl:r|/3l:r 

+ E {6 ) Jogp(y|A^52 )C )]^. 



where qi(zi), q 1 ' 3 {pi-.t\Pi:t), and „(C|^) are the corresponding variational distributions. 
Finally, the variational HEM algorithm for HMMs consists of two alternating steps: 

• (variational E-step) given M. ^ , calculate the variational distributions q % $ p {C\m), Q i ''' {pi-.t\Pi--t)i 



and qi(zi) for the lower bounds in (36), (35), and (34) 



• (M-step) update the model parameters via M.^* = argmax^r) Yld=i £ 
In the following subsections, we derive the E- and M-steps of the algorithm. 



H3M- 



3.3 Variational E-step 

The variational E-step consists of finding the variational distributions that maximize the 



lower bounds in (36), (35), and (34). In particular, given the nesting of the lower bounds, 



we proceed by first maximizing the GMM lower bound C$$$' p) for each pair of emission 
GMMs in the base and reduced models. Next, the HMM lower bound C^ MM is maximized 
for each pair of HMMs in the base and reduced models, followed by maximizing the H3M 
lower bound C l H3M for each base HMM. Finally, a set of summary statistics are calculated, 
which will be used in the M-step. 



3.3.1 Variational distributions 

We first consider the forms of the three variational distributions, as well as the optimal 
parameters to maximize the corresponding lower bounds. 



GMM: For the GMM lower bound £ 



has the form (Hershey et al., 2008) 



GMM ' 



we assume each variational distribution 



-,'•■) 



(C = l\m) = r/Jj 



m 



where Y,t=i V e \ 



M (i,/3),(j,p) 



1, and rfff'M > 0, 



Intuitively, rj 



(37) 



is the responsi- 



bility matrix between each pair of Gaussian components in the GMMs M °i and M fl, 

jiP 

where rfl'^ represents the probability that an observation from component m of A4^ bj 



corresponds to component £ of Ai - 



(0 



13 



Coviello, Chan, Lanckriet 



Substituting into ( 36 ) , we have 

M M 

E(b),i sr^ (i 



{i,P),(j,p) 



GMM 



m=l 



<l}:\ 



m 



log 



Jr)J 



1=1 



'U\m 



+ E 



M 



(b) 



[logp(y\M 



(r) 



The maximizing variational parameters are obtained as (see Appendix C.2) 



'k\m 



(r)j 

c P ,e ex P< 


[E^JbgpfelxS,)]] 


\ 


cjJ J exp < 


E^JlogpfelA^,,)]} 



(38) 



(39) 



where the expected log- likelihood of a Gaussian MS ' t with respect to another Gaussian 



- b ] is computable in closed-form (iPenny and Roberts 



2000), 



E w pogp(»|A<^)] 



(r) m ^ 



1 



log 2tt - - log 



(6).»\T/- V (r)j\-1/ Mj 



(b),i 



(40) 



HMM: For the HMM lower bound £# MM , 



we assume each variational distribution 



takes the form of a Markov chain, 

T 

^'(Pl-l/^r) = <fM{pivr\M = $ (f>l\Pl) U W(Pt\pt-l, fit), 



(41) 



t=2 



where Y^ pi =i = 1) an d Em=i 0t J (Pt|Pi-ij A) = 1 3 an d au the factors are non 



s Y(p 1 |/3 1 ) = 1, andEjU ^I 
negative. The variational distribution q 1 ' 3 (px: T \Pi:t) represents the probability of the state 

(r) (r) 

sequence p\- T in HMM AA - , when Mr- is used to explain the observation sequence gener- 

,(b) 



ated by M\ that evolved through state sequence fi\- T . 
Substituting into (35), we have 



C HMM =J2^t^!fJ2 ( t )i ' j (f >1 ^P 1 ^ \ lQ g 



7T. 



Pl.T 



Pl.T 



GMM 



(42) 



The maximization with respect to cj) t (pt\pt-i, fit) and 4>{ \Pl\fil) ls carried out indepen- 
dently for each pair and follows (Hershey et al. 2008). This is further detailed in 
Appendix A. By separating terms and breaking up the summation over /3 1:T and px- T , the 
optimal (jyj.' 3 (pt\pt-i, fit) and <j){ 3 (p\\fii) can be obtained using an efficient recursive iteration 
(similar to the forward algorithm). 

H3M: For the H3M lower bound C l H3M , we assume variational distributions of the form 



Qi{zi = j) = Zij, where Ylf=i z ij = 1> an d z ij > 0. Substituting into (34), we have 



£-H3M = max Y] 



log 



UJ 



(r) 

' ^I'-'HMM 



(43) 
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The maximizing variational parameters of (43) are obtained in Appendix C.2, 



exp(NiC 



HMMJ 



(44) 



Note that in the standard HEM algorithm derived in ( Vasconcelos and Lippman, 1998 Chan 



et al. 2010b), the assignment probabilities z%j are based on the expected log-likelihoods of 



the components, (e.g., E u Q>)[logp(yi :T \M^ )] for H3Ms). For the variational HEM algo- 
rithm, these expectations are now replaced with their lower bounds (in our case, C^hmm)- 

3.3.2 Lower bound 



Substituting the optimal variational distributions into (38), (42), and (43) gives the lower 
bounds, 



J H3M 



r i -3 

HMM 



C 



GMM 



| log 



UJ 



(r) 



7T 



(r),j 

Pl-.T 



31:t 
M 



Pi: 



/ „ C /3,m / , 



M 



~(i,P),(j,p) 



(i,Pt),(j,Pt) 
GMM 



m=l 



'U\m 



(45) 
(46) 

(47) 



The lower bound C l ^ MM requires summing over all sequences fi\- T and p\ :T . This summation 
can be computed efficiently along with fy 3 {pt\pt-li fit) and <t>{ 3 (pi\fi%) using a recursive 



algorithm from (Hershey et al. 2008). This is described in Appendix A. 



3.3.3 Summary Statistics 

After calculating the optimal variational distributions, we calculate the following summary 
statistics, which are necessary for the M-step: 



^ 3 (p 1 ,fi 1 )=^' i ^ j (p 1 \fi 1 ), 



(48) 



tt 3 (Pt-i, Pt, fit) = u t-i(pt-i,fit-i)a { ^ iA 
\Pt-i=i 
s 

v% t \Pu Pt) = E ^(Pt-U Pufit), for t = 2, . . 
Pt-i=l 



(jH\fH-i,Pt), fort = 2,...,r, (49) 



(50) 
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and the aggregate statistics 

s 

i>¥( P ) = J2^ j (p>p)> ( 51 ) 
0=1 

r 

T S 

i=2 0=1 

The statistic v 1 ^ {p) is the expected number of times that the HMM Mj starts from state 

p, when modeling sequences generated by Ai\ . The quantity i) l ^(p,f3) is the expected 

number of times that the HMM M - is in state p when the HMM M\ is in state f3, when 

both HMMs are modeling sequences generated by M.\ . Similarly, the quantity C'^(p,p') 

(r) 

is the expected number of transitions from state p to state p of the HMM Mj , when 
modeling sequences generated by M.^ ■ 



3.4 M-step 



In the M-step, the lower bound in (33) is maximized with respect to the parameters MS r \ 

K (b) 

A<M* = argmax^£W (54) 
MM i=1 

The derivation of the maximization is presented in Appendix B. Each mixture component 
of Ai^ is updated independently according to 



H=l 

U j K (b) 

2-,i=l z i,j^i v \ VP) (r),j* _ 2^i=l C, J \P-,P) 



»=1 fEE<| 

- — i7ih\ — ' \ bb ) 



{r),j* _ £^t=l ~hj~i -1 \r> \rj,j _ ^i=v ~hj~i •> vr; r i /-,,-, 

(M,P),(j, P )\ (H,PW,P) ,1b) f 

(r),j* _ U 3,P yle\m ) (r ),j* _ \Jk\m Vf},m, 

c p.l ~ u / <■„• m f„- „i\ ' ^o.^ ~~ / c; m /-,; „i\ ' V D 'J 



Z^'=l si J,P \jfy| m J si J.P J 



°* (^ , ' e " ) 




) 







(58) 



where Qj,p(-) is the weighted sum operator over all base models, HMM states, and GMM 
components (i.e., over all tuples (i,/3,m)), 

s m 

n ijP (/(i, p, m)) = kA b) E pw E ^ m )- ( 59 ) 

t=l 0=1 m=l 
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Note that the covariance matrices of the reduced models in (58) include an additional 
outer-product term, which acts to regularize the covariances of the base models. This 
regularization effect derives from the E-step, which averages all possible observations from 
the base model. 



4. Applications and related work 

In the previous section, we derived the VHEM-H3M algorithm to cluster HMMs. We now 
discuss various applications of the algorithm (Section 4.1 ), and then present some literature 
that is related to HMM clustering (Section |4~2|). 



4.1 Applications of the VHEM-H3M algorithm 

The proposed VHEM-H3M algorithm clusters HMMs directly through the distributions they 
represent, and learns novel HMM cluster centers that compactly represent the structure of 
each cluster. 

An application of the VHEM-H3M algorithm is in hierarchical clustering of HMMs. In 
particular, the VHEM-H3M algorithm is used recursively on the HMM cluster centers, to 
produce a bottom-up hierarchy of the input HMMs. Since the cluster centers condense the 
structure of the clusters they represent, the VHEM-H3M algorithm can implicitly leverage 
rich information on the underlying structure of the clusters, which is expected to impact 
positively the quality of the resulting hierarchical clustering. 

Another application of VHEM is for efficient estimation of H3Ms from data, by using a 
hierarchical estimation procedure to break the learning problem into smaller pieces. First, a 
data set is split into small (non-overlapping) portions and intermediate HMMs are learned 
for each portion, via standard EM. Then, the final model is estimated from the intermediate 
models using the VHEM-H3M algorithm. Because VHEM and standard EM are based on 
similar maximum-likelihood principles, it drives model estimation towards similar optimal 
parameter values as performing EM estimation directly on the full dataset. However, com- 
pared to direct EM estimation, VHEM-H3M is more memory- and time-efficient. First, 
it no longer requires storing in memory the entire data set during parameter estimation. 
Second, it does not need to evaluate the likelihood of all the samples at each iteration, and 
converges to effective estimates in shorter times. Note that even if a parallel implementation 
of EM could effectively handle the high memory requirements, a parallel- VHEM will still 
require fewer resources than a parallel-EM. 

In addition, for the hierarchical procedure, the estimation of the intermediate models 
can be easily parallelized, since they are learned independently of each other. Finally, hi- 
erarchical estimation allows for efficient model updating when adding new data. Assuming 
that the previous intermediate models have been saved, re-estimating the H3M requires 
learning the intermediate models of only the new data, followed by running VHEM again. 
Since estimation of the intermediate models is typically as computationally intensive as the 
VHEM stage, reusing the previous intermediate models will lead to considerable computa- 
tional savings when re-estimating the H3M. 

In hierarchical estimation (EM on each time-series, VHEM on intermediate models), 
VHEM implicitly averages over all possible observations (virtual variations of each time- 
series) compatible with the intermediate models. We expect this to regularize estimation, 
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which may result in models that generalize better (compared to estimating models with 
direct EM). Lastly, the "virtual" samples (i.e., sequences), which VHEM implicitly generates 
for maximum-likelihood estimation, need not be of the same length as the actual input data 
for estimating the intermediate models. Making the virtual sequences relatively short will 
positively impact the run time of each VHEM iteration. This may be achieved without loss 



of modeling accuracy, as show in Section 6.3 
4.2 Related work 



Jebara et al. (2007)'s approach to clustering HMMs consists of applying spectral clustering 
to a probability product kernel (PPK) matrix between HMMs — we will refer to it as 
PPK-SC. In particular, the PPK similarity between two HMMs, and MS b \ is defined 



as 



k(a,b) = / P (yi:r\M^) X p(yi:r\M^) X dy 1: r, (60) 



where A is a scalar, and r is the length of "virtual" sequences. The case A = \ corresponds 
to the Bhattacharyya affinity. While this approach indirectly leverages the probability dis- 
tributions represented by the HMMs (i.e., the PPK affinity is computed from the probability 
distributions of the HMMs) and has proven successful in grouping HMMs into similar clus- 
ters (Jebara et al. 2007| >, it has several limitations. First, the spectral clustering algorithm 



cannot produce novel HMM cluster centers to represent the clusters, which is suboptimal 
for several applications of HMM clustering. For example, when implementing hierarchi- 
cal clustering in the spectral embedding space (e.g., using hierarchical k- means clustering), 
clusters are represented by single points in the embedding space. This may fail to capture 
information on the local structure of the clusters that, when using VHEM-H3M, would be 
encoded by the novel HMM cluster centers. Hence, we expect VHEM-H3M to produce 
better hierarchical clustering than the spectral clustering algorithm, especially at higher 
levels of the hierarchy. This is because, when building a new level, VHEM can leverage 
more information from the lower levels, as encoded in the HMM cluster centers. 

One simple extension of PPK-SC to obtain a HMM cluster center is to select the input 
HMM that the spectral clustering algorithm maps closest to the spectral clustering center. 
However, with this method, the HMM cluster centers are limited to be one of the existing 



input HMMs (i.e., similar to k-medoids (Kaufman and Rousseeuw, 1987)), instead of the 



HMMs that optimally condense the structure of the clusters. Therefore, we expect the 
novel HMM cluster centers learned by VHEM-H3M to better represent the clusters. A 
more involved, "hybrid" solution is to learn the HMM cluster centers with VHEM-H3M 
after obtaining clusters with PPK-SC — using the VHEM-H3M algorithm to summarize 
all the HMMs within each PPK-SC cluster into a single HMM. However, we expect our 
VHEM-H3M algorithm to learn more accurate clustering models, since it jointly learns the 
clustering and the HMM centers, by optimizing a single objective function (i.e., the lower 



bound to the expected log-likelihood in (33)). 

A second drawback of the spectral clustering algorithm is that the construction and the 
inversion of the similarity matrix between the input HMMs is a costly operation when their 
number is large (e.g., see the experiment on H3M density estimation on the music data in 



Section 6.1). Therefore, we expect VHEM-H3M to be computationally more efficient than 
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the spectral clustering algorithm since, by directly operating on the probability distributions 
of the HMMs, it does not require the construction of an initial embedding or any costly 
matrix operation on large kernel matrices. 



Finally, as Jebara et al. (2004) note, the exact computation of (60) cannot be carried 



out efficiently, unless A = 1. For different values of A Jebara et al. (2004) propose to 



approximate k(a,b) with an alternative kernel function that can be efficiently computed; 
this alternative kernel function, however, is not guaranteed to be invariant to different but 
equivalent representations of the hidden state process (Jebara et al. 2004)]^] 



Note that spectral clustering algorithms similar to (Jebara et al. 2007[ ) can be applied 
to kernel (similarity) matrices that are based on other affinity scores between HMM distri- 



butions than the PPK similarity of Jebara et al. (2004). Examples can be found in earlier 



work on HMM-based clustering of time-series, such as Juang and Rabiner (1985), Lyngso 



et al. (1999), Bahlmann and Burkhardt (2001), Panuccio et al. (2002). In particular, Juang 



and Rabiner ( 1985 ) propose to approximate the (symmetrised) log-likelihood between two 
HMM distributions by computing the log-likelihood of real samples generated by one model 
under the otherF] Extensions of Juang and Rabiner ( 1985 ) have been proposed by Zhong 
and Ghosh (2003) and Yin and Yang (2005). In this work we do not pursue a comparison 



of the various similarity functions, but implement spectral clustering only based on PPK 



similarity (which Jebara et al. (2007) showed to be superior). 

HMMs can also be clustered by sampling a number of time-series from each of the HMMs 



in the base mixture, and then applying the EM algorithm for H3Ms (Smyth, 1997), to cluster 



the time-series. Despite its simplicity, this approach would suffer from high memory and 
time requirements, especially when dealing with a large number of input HMMs. First, 
all generated samples need to be stored in memory. Second, evaluating the likelihood of 
the generated samples at each iteration is computationally intensive, and prevents the EM 
algorithm from converging to effective estimates in acceptable times J^] On the contrary, 
VHEM-H3M is more efficient in computation and memory usage, as it replaces a costly 
sampling step (along with the associated likelihood computations at each iteration) with an 
expectation. An additional problem of EM with sampling is that, with a simple application 
of the EM algorithm, time-series generated from the same input HMM can be assigned 
to different clusters of the output H3M. As a consequence, the resulting clustering is not 
necessary consistent, since in this case the corresponding input HMM may not be clearly 
assigned to any single cluster. In our experiments, we circumvent this problem by defining 
appropriate constrains on the assignment variables. 



The experimental resu lts in ( Jebara et al 

computed 



The kernel in ( 60 1 is 



bj 



2004 1 and ( Jebara et al. 



out 



20071 suggest to use A < 1. 
hidden 



the Hidden state variables, i.e. 
This can be efficiently solved with 
7^ 1, |Jebara et H] (2004} pro- 



marginalizing 

/ (j2 xl .. T P(y^,Xl:r\M^)y (Y, xl .. T P(yi:T,Xl:,\M^)) X dy 1:T . 

the junction tree algorithm only when A = 1. For A 

pose to use an alternative kernel k that applies the power operation to the terms of the sum 
rather than the entire sum, where the terms are joint probabilities p(yi-. T , xi- T ). I.e., k(a,b) = 

/E» 1!T (piVl-.r^^M^Y E Bl!T (p(Vl:r, Xl: T \M^)) X d Vl:T . 

For two HMM distributions, M {a) and M {h \ Juang and Rabiner (19851 consider the affinity L(a,b) = 
| [log p(Y b \M {a) ) + p{Y a \M lb) )\ , where Y a and Y b are sets of observation sequences generated from M^ a > 
and M^ b \ respectively. 

In our experiments, EM on generated samples took two orders of magnitude more time than VHEM. 
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The VHEM algorithm is similar in spirit to Bregman-clustering in Banerjee et al. (2005 ). 
Both algorithms base clustering on KL-divergence — the KL divergence and the expected 
log-likelihood differ only for an entropy term that does not affect the clustering. The 
main differences are: 1) in our setting, the expected log-likelihood (and KL divergence) is 
not computable in closed form, and hence VHEM uses an approximation; 2) VHEM-H3M 
clusters random processes (i.e., time series models), whereas (Banerjee et al. |2005 ) is limited 
to single random variables. 

In the next two sections, we validate the points raised in this discussion through exper- 
imental evaluation using the VHEM-H3M algorithm. In particular, we consider clustering 
experiments in Section [5j and H3M density estimation for automatic annotation and re- 
trieval in Section [6j Each application exploits some of the benefits of VHEM. First, we 
show that VHEM-H3M is more accurate in clustering than PPK-SC, in particular at higher 
levels of a hierarchical clustering (Section 5.2), and in an experiment with synthetic data 
(Section 5.3). Similarly, the annotation and retrieval results in Section [6] favor VHEM-H3M 
over PPK-SC and over standard EM, suggesting that VHEM-H3M is more robust and ef- 
fective for H3M density estimation. Finally, in all the experiments, the running time of 
VHEM-H3M compares favorably with the other HMM clustering algorithms; PPK-SC suf- 
fers long delays when the number of input HMMs is large and the standard EM algorithm 
is considerably slower. This demonstrates that VHEM-H3M is most efficient for clustering 
HMMs. 



5. Clustering Experiments 

In this section, we present an empirical study of the VHEM-H3M algorithm for clustering 
and hierarchical clustering of HMMs. Clustering HMMs consists in partitioning K\ input 
HMMs into K 2 < K\ groups of similar HMMs. Hierarchical clustering involves organizing 
the input HMMs in a multi-level hierarchy with h levels, by applying clustering in a recursive 
manner. Each level i of the hierarchy has Kg groups (with K\ > K 2 > ■ ■ ■ > K^-i > K^), 
and the first level consists of the K\ input HMMs. 

We begin with an experiment on hierarchical clustering, where each of the input HMMs 



to be clustered is estimated on a sequence of motion capture data (Section 5.2). Then, we 
present a simulation study on clustering synthetic HMMs (Section |5.3| ). First, we provide 
an overview of the different algorithms used in this study. 

5.1 Clustering methods 

In the clustering experiments, we will compare our VHEM-H3M algorithm with several 
other clustering algorithms. The various algorithms are summarized here. 

• VHEM-H3M: We cluster K x input HMMs into K 2 clusters by using the VHEM-H3M 
algorithm (on the input HMMs) to learn a H3M with K 2 components (as explained 



in Section 3.1 ). To build a multi-level hierarchy of HMMs with h levels, we start from 
the first level of K\ input HMMs, and recursively use the VHEM-H3M algorithm 
h— 1 times. Each new level I is formed by clustering the Ki—\ HMMs at the previous 
level into Ki < Kn—\ groups with the VHEM-H3M algorithm, and using the learned 
HMMs as cluster centers at the new level. In our experiments, we set the number of 
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virtual samples to N = 10 4 iT^ -1 ^, a large value that favors "hard" clustering (where 
each HMM is univocally assigned to a single cluster), and the length of the virtual 
sequences to r = 10. 



PPK-SC: Jebara et al. (2007) cluster HMMs by calculating a PPK similarity matrix 



between all HMMs, and then applying spectral clustering. The work in Jebara et al. 



(2007) only considered HMMs with single Gaussian emissions, which did not always 



give satisfactory results in our experiments. Hence, we extended (Jebara et al. 2007) 



by allowing GMM emissions, and derived the PPK similarity for this more general 
case using (Jebara et al. 2004). From preliminary experiments, we found the best 



performance for PPK with A = ^ (i- e -> Bhattacharyya affinity), and when integrating 
over sequences of length r = 10. Finally, we also extend Jebara et al. (2007) to con- 



struct multi- level hierarchies, by using hierarchical k-means in the spectral clustering 
embedding. 

SHEM-H3M: This is a version of HEM-H3M that maximizes the likelihood of actual 



samples generated from the input HMMs, as in (10), rather than the expectation of 
virtual samples, as in |ll). In particular, from each input HMM Aif^ we sample a 
set Yi of Ni = 7r- N observation sequences (for a large value of N). We then estimate 
the reduced H3M from the N samples Y = {Yi}ft}^ , with the EM-H3M algorithm of 



Smyth (1997), which was modified to use a single assignment variable for each sample 



set Yi, to obtain a consistent clustering. 

In many real- life applications, the goal is to cluster a collection of time series, i.e., 
observed sequences. Although the input data is not a collection of HMMs in that case, 
it can still be clustered with the VHEM-H3M algorithm by first modeling each sequence 
as an HMM, and then using the HMMs as input for the VHEM-H3M algorithm. With 
time-series data as input, it is also possible to use clustering approaches that do not model 
each sequence as a HMM. Hence, in one of the hierarchical motion clustering experiments, 
we also compare to the following two algorithms, one that clusters time-series data directly 



(Smyth, 1997), and a second one that clusters the time series after modeling each sequence 



with a dynamic texture (DT) model (Chan et al. 2010b). 



EM-H3M: The EM algorithm for H3Ms (Smyth, 1997) is applied directly on a codec 



tion of time series to learn the clustering and HMM cluster centers, thus bypassing the 
intermediate HMM modeling stage. To obtain a hierarchical clustering (with h > 3 
levels), we proceed in a bottom up fashion and build each new level by simply re- 



clustering the given time series in a smaller number of clusters using (Smyth, 1997). 



We extend the algorithm to use a single assignment variable for each set of sequences 
Yi that are within the same cluster in the immediately lower level of the hierarchy. 
This modification preserves the hierarchical clustering property that sequences in a 
cluster will remain together at the higher levels. 

HEM-DTM: Rather than use HMMs, we consider a clustering model based on linear 



dynamical systems, i.e., dynamic textures (DTs) (Doretto et al. 2003). Hierarchical 



clustering is performed using the hierarchical EM algorithm for DT mixtures (HEM- 



DTM) ( Chan et al. 2010b), in an analogous way to VHEM-H3M. The main difference 
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(a) "Walk" sequence. 




Figure 1: Examples of motion capture sequences from the MoCap dataset, shown with stick 
figures. 



is that, with HEM-DTM, time-series are modeled as DTs, which have a continu- 
ous state space (a Gauss-Markov model) and unimodal observation model, whereas 
VHEM-H3M uses a discrete state space and multimodal observations (GMMs). 



We will use several metrics to quantitatively compare the results of different clustering 



algorithms. First, we will calculate the Rand-index (Hubert and Arabie, 1985), which 



measures the correctness of a proposed clustering against a given ground truth clustering. 
Intuitively, this index measures how consistent cluster assignments are with the ground 
truth (i.e., whether pairs of items are correctly or incorrectly assigned to the same cluster, 
or different clusters). Second, we will consider the log-likelihood, as used by Smyth (1997) 
to evaluate a clustering. This measures how well the clustering fits the input data. When 
time series are given as input data, we compute the log-likelihood of a clustering as the sum 
of the log-likelihoods of each input sequence under the HMM cluster center to which it has 
been assigned. When the input data consists of HMMs, we will evaluate the log-likelihood 
of a clustering by using the expected log-likelihood of observations generated from an input 
HMM under the HMM cluster center to which it is assigned. For PPK-SC, the cluster center 
is estimated by running the VHEM-H3M algorithm (with = 1) on the HMMs assigned 
to the cluster^] Note that the log-likelihood will be particularly appropriate to compare 
VHEM-H3M, SHEM-H3M, EM-H3M and HEM-DTM, since they explicitly optimize for it. 
However, it may be unfair for PPK-SC, since this method optimizes the PPK similarity and 
not the log-likelihood. As a consequence, we also measure the PPK cluster-compactness, 
which is more directly related to what PPK-SC optimizes for. The PPK cluster-compactness 
is the sum (over all clusters) of the average intra-cluster PPK pair-wise similarity. This 
performance metric favors methods that produce clusters with high intra-cluster similarity. 
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Figure 2: An example of hierarchical clustering of the MoCap dataset, with VHEM-H3M 
and PPK-SC. Different colors represent different motion classes. Vertical bars represent 
clusters, with the colors indicating the proportions of the motion classes in a cluster, and 
the numbers on the x-axes representing the clusters' indexes. At Level 1 there are 56 
clusters, one for each motion sequence. At Levels 2, 3 and 4 there are 8, 4 and 2 HMM 
clusters, respectively. For VHEM almost all clusters at Level 2 are populated by examples 
from a single motion class. The error of VHEM in clustering a portion of "soccer" with 
"basket" is probably because both actions involve a sequence of movement, shot, and pause. 
Moving up the hierarchy, the VHEM algorithm clusters similar motions classes together, 
and at Level 4 creates a dichotomy between "sit" and the other (more dynamic) motion 
classes. PPK-SC also clusters motion sequences well at Level 2, but incorrectly aggregates 
"sit" and "soccer", which have quite different dynamics. At Level 4, the clustering obtained 
by PPK-SC is harder to interpret than that by VHEM. 



5.2 Hierarchical motion clustering 

In this experiment we test the VHEM algorithm on hierarchical motion clustering from 
motion capture data, i.e., time series representing human locomotions and actions. To 
hierarchically cluster a collection of time series, we first model each time series with an HMM 
and then cluster the HMMs hierarchically. Since each HMM summarizes the appearance 
and dynamics of the particular motion sequence it represents, the structure encoded in the 



hierarchy of HMMs directly applies to the original motion sequences. Jebara et al. (2007) 
uses a similar approach to cluster motion sequences, applying PPK-SC to cluster HMMs. 
However, they did not extend their study to hierarchies with multiple levels. 



6. Alternatively, we could use as cluster center the HMM mapped the closest to the spectral embedding 
cluster center, but this always resulted in lower log-likelihood. 
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5.2.1 Datasets and setup 



We experiment on two motion capture datasets, the MoCap dataset (http://mocap.cs 



cmu.edu/) and the Vicon Physical Action dataset (Theodoridis and Hu, 2007; Frank and 



Asuncion, 2010). For the MoCap dataset, we use 56 motion examples spanning 8 different 
classes ("jump", "run", "jog", "walk 1", "walk 2", "basket", "soccer", and "sit"). Each 
example is a sequence of 123-dimensional vectors representing the (x, y, z)-coordinates of 41 
body markers tracked spatially through time. Figure [T] illustrates some typical examples. 
We built a hierarchy of h = 4 levels. The first level (Level 1) was formed by the K\ = 56 
HMMs learned from each individual motion example (with S = 4 hidden states, and M = 2 
components for each GMM emission). The next three levels contain K2 = 8, K3 = 4 and 
K 4 = 2 HMMs. We perform the hierarchical clustering with VHEM-H3M, PPK-SC, EM- 
H3M, SHEM-H3M (JV € {560,2800} and r = 10), and HEM-DTM (state dimension of 7). 
The experiments were repeated 10 times for each clustering method, using different random 
initializations of the algorithms. 

The Vicon Physical Action dataset is a collection of 200 motion sequences. Each se- 
quence consists of a time series of 27-dimensional vectors representing the (x, y, z)-coordinates 
of 9 body markers captured using the Vicon 3D tracker. The dataset includes 10 normal 
and 10 aggressive activities, performed by each of 10 human subjects a single time. We 
build a hierarchy of h = 5 levels, starting with K\ = 200 HMMs (with 5 = 4 hidden states 
and M = 2 components for each GMM emission) at the first level (i.e., one for each motion 
sequence), and using K<i = 20, K3 = 8, K4 = 4, and K§ = 2 for the next four levels. The 
experiment was repeated 5 times with VHEM-H3M and PPK-SC, using different random 
initializations of the algorithms. 

In similar experiments where we varied the number of levels h of the hierarchy and 
the number of clusters at each level, we noted similar relative performances of the various 
clustering algorithms, on both datasets. 



5.2.2 Results on the MoCap dataset 

An example of hierarchical clustering of the MoCap dataset with VHEM-H3M is illustrated 
in Figure [2] (left). In the first level, each vertical bar represents a motion sequence, with 
different colors indicating different ground-truth classes. In the second level, the K2 = 8 
HMM clusters are shown with vertical bars, with the colors indicating the proportions of 
the motion classes in the cluster. Almost all clusters are populated by examples from a 
single motion class (e.g., "run", "jog", "jump"), which demonstrates that VHEM can group 
similar motions together. We note an error of VHEM in clustering a portion of the "soccer" 
examples with "basket" . This is probably caused by the similar dynamics of these actions, 
which both consist of a sequence of movement, shot, and pause. Moving up the hierarchy, 
the VHEM algorithm clusters similar motion classes together (as indicated by the arrows), 
for example "walk 1" and "walk 2" are clustered together at Level 2, and at the highest 
level (Level 4) it creates a dichotomy between "sit" and the rest of the motion classes. This 
is a desirable behavior as the kinetics of the "sit" sequences (which in the MoCap dataset 
correspond to starting in a standing position, sitting on a stool, and returning to a standing 
position) are considerably different from the rest. On the right of Figure [2j the same 
experiment is repeated with PPK-SC. PPK-SC clusters motion sequences properly, but 
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Table 2: Hierarchical clustering of the MoCap dataset using VHEM-H3M, PPK-SC, SHEM- 
H3M, EM-H3M and HEM-DTM. The number in brackets after SHEM-H3M represents the 
number of real samples used. We computed Rand-index, data log-likelihood and cluster 
compactness at each level of the hierarchy, and registered the time (in seconds) to learn 
the hierarchical structure. Differences in Rand-index at Levels 2, 3, and 4 are statistically 
significant based on a paired t-test with confidence 95%. 





Rand-index 


log-likelihood (xlO 6 ) 


PPK cluster-compactness 


time (s) 


Level 


2 


3 


4 


2 


3 


4 


2 


3 


4 




VHEM-H3M 


0.937 


0.811 


0.518 


-5.361 


-5.682 


-5.866 


0.0075 


0.0068 


0.0061 


30.97 


PPK-SC 


0.956 


0.740 


0.393 


-5.399 


-5.845 


-6.068 


0.0082 


0.0021 


0.0008 


37.69 


SHEM-H3M (560) 


0.714 


0.359 


0.234 


-13.632 


-69.746 


-275.650 


0.0062 


0.0034 


0.0031 


843.89 


SHEM-H3M (2800) 


0.782 


0.685 


0.480 


-14.645 


-30.086 


-52.227 


0.0050 


0.0036 


0.0030 


3849.72 


EM-H3M 


0.831 


0.430 


0.340 


-5.713 


-202.55 


-168.90 


0.0099 


0.0060 


0.0056 


667.97 


HEM-DTM 


0.897 


0.661 


0.412 


-7.125 


-8.163 


-8.532 








121.32 



incorrectly aggregates "sit" and "soccer" at Level 2, even though they have quite different 
dynamics. Furthermore, the highest level (Level 4) of the hierarchical clustering produced 
by PPK-SC is harder to interpret than that of VHEM. 

Table [2] presents a quantitative comparison between PPK-SC and VHEM-H3M at each 
level of the hierarchy. While VHEM-H3M has lower Rand-index than PPK-SC at Level 
2 (0.937 vs. 0.956), VHEM-H3M has higher Rand-index at Level 3 (0.811 vs. 0.740) and 
Level 4 (0.518 vs. 0.393). In terms of PPK cluster-compactness, we observe similar results. 
In particular, VHEM-H3M has higher PPK cluster-compactness than PPK-SC at Level 3 
and 4. Overall, keeping in mind that PPK-SC is explicitly driven by PPK-similarity, while 
the VHEM-H3M algorithm is not, these results can be considered as strongly in favor of 
VHEM-H3M (over PPK-SC). In addition, the data log-likelihood for VHEM-H3M is higher 
than that for PPK-SC at each level of the hierarchy. This suggests that the novel HMM 
cluster centers learned by VHEM-H3M fit the motion capture data better than the spectral 
cluster centers. This conclusion is further supported by the results of the density estimation 



experiments in Sections 6.1 and 6.2 Note that the higher up in the hierarchy, the more 
clearly this effect is manifested. 

Comparing to other methods (also in Table [2]), EM-H3M generally has lower Rand-index 
than VHEM-H3M and PPK-SC (consistent with the results in ( |Jebara et aL]|2007[ )). While 
EM-H3M directly clusters the original motion sequences, both VHEM-H3M and PPK-SC 
implicitly integrate over all possible virtual variations of the original motion sequences 
(according to the intermediate HMM models), which results in more robust clustering pro- 
cedures. In addition, EM-H3M has considerably longer running times than VHEM-H3M 
and PPK-SC (i.e., roughly 20 times longer) since it needs to evaluate the likelihood of all 
training sequences at each iteration, at all levels. 

The results in Table [2] favor VHEM-H3M over SHEM-H3M, and empirically validate 
the variational approximation that VHEM uses for learning. For example, when using 
N = 2800 samples, running SHEM-H3M takes over two orders of magnitude more time 
than VHEM-H3M, but still does not achieve performance competitive with VHEM-H3M. 
With an efficient closed-form expression for averaging over all possible virtual samples, 
VHEM approximates the sufficient statistics of a virtually unlimited number of observation 
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Table 3: Hierarchical clustering of the Vicon Physical Action dataset using VHEM-H3M 
and PPK-SC. Performance is measured in terms of Rand-index, data log-likelihood and 
PPK cluster-compactness at each level. Differences in Rand-index at Levels 2, 4 and 5 are 
statistically significant based on a paired t-test with confidence 95%. The test failed at 
Level 3. 



Level 


2 


Rand-index 
3 4 


5 


log-likelihood (xlO 6 
2 3 4 


) 

5 


PPK cluster-compactness 
2 3 4 5 


VHEM-H3M 
PPK-SC 


0.909 
0.900 


0.805 0.610 
0.807 0.452 


0.244 
0.092 


-1.494 -3.820 -5.087 
-3.857 -5.594 -6.163 


-6.172 
-6.643 


0.224 0.059 0.020 0.005 
0.324 0.081 0.026 0.008 



sequences, without the need of using real samples. This has an additional regularization 
effect that improves the robustness of the learned HMM cluster centers. In contrast, SHEM- 
H3M uses real samples, and requires a large number of them to learn accurate models, which 
results in significantly longer running times. 

Finally, in Table [2j we also report hierarchical clustering performance for HEM-DTM. 
VHEM-H3M consistently outperforms HEM-DTM, both in terms of Rand-index and data 
log-likelihoodQ Since both VHEM-H3M and HEM-DTM are based on the hierarchical EM 
algorithm for learning the clustering, this indicates that HMM-based clustering models are 
more appropriate than DT-based models for the human MoCap data. Note that, while 
PPK-SC is also HMM-based, it has a lower Rand-index than HEM-DTM at Level 4. This 
further suggests that PPK-SC does not optimally cluster the HMMs. 

5.2.3 Results on the Vicon Physical Action dataset 

Table [3] presents results using VHEM-H3M and PPK-SC to cluster the Vicon Physical 
Action dataset. While the two algorithms performs similarly in terms of Rand-index at 
lower levels of the hierarchy (i.e., Level 2 and Level 3), at higher levels (i.e., Level 4 and Level 
5) VHEM-H3M outperforms PPK-SC. In addition, VHEM-H3M registers higher data log- 
likelihood than PPK-SC at each level of the hierarchy. This, again, suggests that by learning 
new cluster centers, the VHEM-H3M algorithm retains more information on the clusters' 
structure than PPK-SC. Finally, compared to VHEM-H3M, PPK-SC produces clusters that 
are more compact in terms of PPK similarity. However, this does not necessarily imply a 
better agreement with the ground truth clustering, as evinced by the Rand-index metrics. 

5.3 Clustering synthetic data 

In this experiment, we compare VHEM-H3M and PPK-SC on clustering a synthetic dataset 
of HMMs. 

5.3.1 Dataset and setup 

The synthetic dataset of HMMs is generated as follows. Given a set of C HMMs {M^}^ =1 , 
for each HMM we synthesize a set of K "noisy" versions of the original HMM. Each "noisy" 

7. We did not report PPK cluster-compactness for HEM-DTM, since it would not be directly comparable 
with the same metric based on HMMs. 
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HMM .Ml (k = 1, . . . K) is synthesized by generating a random sequence y%-T of length T 
from Ai( c \ corrupting it with Gaussian noise ~ AA(0, cr^Lj), and estimating the parameters 

~ (c) 

of A4 k on the corrupted version of yi-.T- Note that this procedure adds noise in the 
observation space. The number of noisy versions (of each given HMM), K, and the noise 
variance, u^, will be varied during the experiments. 

The collection of original HMMs was created as follows. Their number was always set 
to C = 4, the number of hidden states of the HMMs to S = 3, the emission distributions to 
be single, one- dimensional Gaussians (i.e., GMMs with M = 1 component), and the length 
of the sequences to T = 100. For all original HMMs Ai^ c ' , the initial state probability and 
state transition matrix were fixed as 
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We consider two settings of the emission distributions. In the first setting, experiment (a), 
the HMMs only differ in the means of the emission distributions, 
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In the second setting, experiment (b), the HMMs differ in the variances of the emission 
distributions, 
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The VHEM-H3M and PPK-SC algorithms are used to cluster the synthesized HMMs, 
{{A4^}fc =1 }c =1 , into C groups, and the quality of the resulting clusterings is measured 
with the Rand-index, PPK cluster-compactness, and the expected log-likelihood of the 
discovered cluster centers with respect to the original HMMs. The expected log-likelihood 



was computed using the lower bound, as in (19), with each of the original HMMs assigned 



to the most likely HMM cluster center. The results are averages over 10 trials. 



5.3.2 Results 

Figure [3] reports the performance metrics when varying the number K S {2,4,8, 16,32} of 
noisy versions of each of the original HMMs, and the noise variance <r^ £ {0.1,0.5, 1}, for 
the two experimental settings. For the majority of settings of K and a^, the clustering 
produced by VHEM-H3M is superior to the one produced by PPK-SC, for each of the 
considered metrics (i.e., in the plots, solid lines are usually above dashed lines of the same 
color). The only exception is in experiment (b) where, for low noise variance (i.e., o\ = 0.1) 
PPK-SC is the best in terms of Rand-index and cluster compactness. It is interesting to 
note that the gap in performance between VHEM-H3M and PPK-SC is generally larger 
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Experiment (a): different emission means. 
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Experiment (b): different emission variances. 
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Figure 3: Results on clustering synthetic data with VHEM-H3M and PPK-SC. Performance 
is measured in terms of Rand-index, expected log-likelihood and PPK cluster-compactness. 



at low values of K. We believe this is because, when only a limited number of input 
HMMs is available, PPK-SC produces an embedding of lower quality. This does not affect 
VHEM-H3M, since it clusters in HMM distribution space and does not use an embedding. 

These results suggest that, by clustering HMMs directly in distribution space, VHEM- 
H3M is generally more robust than PPK-SC, the performance of which instead depends on 
the quality of the underlying embedding. 



6. Density estimation experiments 

In this section, we present an empirical study of VHEM-H3M for density estimation, in 
automatic annotation and retrieval of music (Section 6.1) and hand-written digits (Section 
6721). 



6.1 Music annotation and retrieval 

In this experiment, we evaluate VHEM-H3M for estimating annotation models in content- 
based music auto-tagging. As a generative time-series model, H3Ms allow to account for 
timbre (i.e., through the GMM emission process) as well as longer term temporal dynamics 
(i.e., through the HMM hidden state process), when modeling musical signals. Therefore, 
in music annotation and retrieval applications, H3Ms are expected to prove more effective 
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than existing models that do not explicitly account for temporal information (Turnbull 



et al. 2008 Mandel and Ellis 2008; Eck et al. 2008 Hoffman et al. 2009) 



6.1.1 Music dataset 



We consider the CAL500 collection from (Turnbull et al. , 2008), which consists of 502 songs 



and provides binary annotations with respect to a vocabulary V of 149 tags, ranging from 
genre and instrumentation, to mood and usage. To represent the acoustic content of a song 
we extract a time series of audio features y = {y±, . . . , y\y\}, b y computing the first 13 Mel 
frequency cepstral coefficients (MFCCs) (R abiner and Juang| 1993) over half-overlapping 
windows of 46ms of audio signal, augmented with first and second instantaneous derivatives. 
The song is then represented as a collection of audio fragments, which are sequences of 
T = 125 audio features (approximately 6 seconds of audio), using a dense sampling with 
Vo overlap. 



6.1.2 Music annotation models 



Automatic music tagging is formulated as a supervised multi-label problem ( Carneiro et al 



2007), where each class is a tag from V. We approach this problem by modeling the audio 



content for each tag with a H3M probability distribution. I.e., for each tag, we estimate 
an H3M over the audio fragments of the songs in the database that have been associated 
with that tag, using the hierarchical estimation procedure based on VHEM-H3M. More 
specifically, the database is first processed at the song level, using the EM algorithm to 
learn a H3M with = 6 components for each song^J from its audio fragments. Then, 
for each tag, the song-level H3Ms labeled with that tag are pooled together to form a large 
H3M, and the VHEM-H3M algorithm is used to reduce this to a final H3M tag-model with 
K = 3 components (r = 10 and N = N v NtK^ s \ where N v = 1000 and Nt is the number of 
training songs for the particular tag). 

Given the tag-level models, a song can be represented as a vector of posterior proba- 
bilities of each tag (a semantic multinomial, SMN), by extracting features from the song, 
computing the likelihood of the features under each tag-level model, and applying Bayes' 
rule. A test song is annotated with the top-ranking tags in its SMN. To retrieve songs given 
a tag query, a collection of songs is ranked by the tag's probability in their SMNs. 

We compare VHEM-H3M with three alternative algorithms for estimating the H3M tag 
models: PPK-SC, PPK-SC-hybrid, and EM-H3M0 For all three alternatives, we use the 
same number of mixture components in the tag models (K = 3). For the two PPK-SC 



methods, we leverage the work of Jebara et al. (2007) to learn H3M tag models, and use 



it in place of the VHEM-H3M algorithm in the second stage of the hierarchical estimation 
procedure. We found that it was necessary to implement the PPK-SC approaches with song- 



8. Most pop songs have 6 structural parts: intro, verse, chorus, solo, bridge and outro. 

9. For this experiment, we were not able to successfully estimate accurate H3M tag models with SHEM- 
H3M. In particular, SHEM-H3M requires generating an appropriately large number of real samples to 
produce accurate estimates. However, due to the computational limits of our cluster, we were able 
to test SHEM-H3M only using a small number of samples. In preliminary experiments we registered 
performance only slightly above chance level and training times still twice longer than for VHEM-H3M. 
For a comparison between VHEM-H3M and SHEM-H3M on density estimation, the reader can refer to 



the experiment in Section 6.2 on online hand-writing classification and retrieval. 
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level H3Ms with only = 1 component (i.e., a single HMM), since the computational 
cost for constructing the initial embedding scales poorly with the number of input HMMs 10 
PPK-SC first applies spectral clustering to the song-level HMMs and then selects as the 
cluster centers the HMMs that map closest to the spectral cluster centers in the spectral 
embedding. PPK-SC-hybrid is a hybrid method combining PPK-SC for clustering, and 
VHEM-H3M for estimating the cluster centers. Specifically, after spectral clustering, HMM 
cluster centers are estimated by applying VHEM-H3M (with K^ r ' = 1) to the HMMs 
assigned to each of the resulting clusters. In other words, PPK-SC and PPK-SC-hybrid use 
spectral clustering to summarize a collection of song-level HMMs with a few HMM centers, 
forming a H3M tag model. The mixture weight of each HMM component (in the H3M tag 
model) is set proportional to the number of HMMs assigned to that cluster. 

For EM-H3M, the H3M tag models were estimated directly from the audio fragments 
from the relevant songs using the EM-H3M algorithm j^] Empirically, we found that, due 
to its runtime and RAM requirements, for EM-H3M we must use non-overlapping audio- 
fragments and evenly subsample by 73% on average, resulting in 14.5% of the sequences 
used by VHEM-H3M. Note that, however, EM-H3M is still using 73% of the actual song 
data (just with non-overlapping sequences). We believe this to be a reasonable comparison 
between EM and VHEM, as both methods use roughly similar resources (the sub-sampled 
EM is still 3 times slower, as reported in Table [4]). Based on our projections, running EM 
over densely sampled song data would require roughly 9000 hours of CPU time (e.g., more 
than 5 weeks when parallelizing the algorithm over 10 processors), as opposed to 630 hours 
for VHEM-H3M. This would be extremely cpu-intensive given the computational limits of 
our cluster. The VHEM algorithm, on the other hand, can learn from considerable amounts 
of data while still maintaining low runtime and memory requirements P| 

Finally, we also compare against two state-of-the-art models for music tagging, HEM- 
DTM ( |Coviello et al. 2011), which is based on a different time-series model (mixture of 
dynamic textures), and HEM-GMM ( |Turnbull et al. 2008), which is a bag-of- features model 
using GMMs. Both methods use efficient hierarchical estimation based on a HEM algorithm 



(Chan et al. 2010b Vasconcelos and Lippman 1998) to obtain the tag-level models 



13 



6.1.3 Performance metrics 

A test song is annotated with the 10 most likely tags, and annotation performance is 
measured with the per-tag precision (P), recall (R), and F-score (F), averaged over all tags. 



10. 



Running PPK-SC with K = 2 took 3958 hours in total (about 4 times more than when setting 
j^W — with no improvement in annotation and retrieval performance. A larger K^ 3 ^ 1 would yield 
impractically long learning times. 



12. 



13. 



11. The EM algorithm has been used to estimate HMMs from music data in previous work, e.g., (Scaringella 



and Zoia 


2005 


Reed and Lee 


2006 



fragments. Using the hierarchical estimation procedure, we first model each song (in average, 15MB of 
audio fragments) individually as a song-level H3M, and we save the song models (150 KB of memory 
each). Then, we pool the 200 song models into a large H3M (in total 30MB of memory), and reduce it 
to a smaller tag-level H3M using the VHEM-H3M algorithm. 

Both auto-taggers operate on audio features extracted over half-overlapping windows of 46ms. HEM- 
GMM uses MFCCs with first and second instantaneous derivatives dTurnbull et al. 1 120081). HEM-DTM 



uses 34-bins of Mel-spectral features ( Coviello et al. 2011 1, which are further grouped in audio fragments 
of 125 consecutive features. 
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Table 4: Annotation and retrieval performance on CAL500, for VHEM-H3M, PPK-SC, 



PPK-SC- hybrid, EM-H3M, HEM-DTM (Coviello et al. 2011) and HEM-GMM (Turnbull 



et al., 2008). 



annotation retrieval 





P 


R 


F 


MAP 


P@5 


P@10 


P@15 


time (h) 


VHEM-H3M 


0.446 


0.211 


0.260 


0.440 


0.474 


0.451 


0.435 


629.5 


PPK-SC 


0.299 


0.159 


0.151 


0.347 


0.358 


0.340 


0.329 


974.0 


PPK-SC-hybrid 


0.407 


0.200 


0.221 


0.415 


0.439 


0.421 


0.407 


991.7 


EM-H3M 


0.415 


0.214 


0.248 


0.423 


0.440 


0.422 


0.407 


1860.4 


HEM-DTM 


0.431 


0.202 


0.252 


0.439 


0.479 


0.454 


0.428 




HEM-GMM 


0.374 


0.205 


0.213 


0.417 


0.441 


0.425 


0.416 





If \wh\ is the number of test songs that have the tag w in their ground truth annotations, 
\wa\ is the number of times an annotation system uses w when automatically tagging a 
song, and \wc\ is the number of times w is correctly used, then precision, recall and F-score 
for the tag w are defined as: 

P = 1^1 R = F = 2 ((p) -i + (R) -i)-i . (64) 

\WA\ \W H \ 

Retrieval is measured by computing per-tag mean average precision (MAP) and precision 
at the first k retrieved songs (P@k), for k £ {5, 10, 15}. The P@k is the fraction of true 
positives in the top-k of the ranking. MAP averages the precision at each point in the 
ranking where a song is correctly retrieved. All reported metrics are averages over the 
97 tags that have at least 30 examples in CAL500 (11 genre, 14 instrument, 25 acoustic 
quality, 6 vocal characteristics, 35 emotion and 6 usage tags), and are the result of 5-fold 
cross-validation. 

Finally, we also record the total time (in hours) to learn the 97 tag-level H3Ms on the 
5 splits of the data. For hierarchical estimation methods (VHEM-H3M and the PPK-SC 
approaches), this also includes the time to learn the song- level H3Ms. 

6.1.4 Results 

In Table [4] we report the performance of the various algorithms for both annotation and 
retrieval on the CAL500 dataset. Looking at the overall runtime, VHEM-H3M is the most 
efficient algorithm for estimating H3M distributions from music data, as it requires only 
34% of the runtime of EM-H3M, and 65% of the runtime of PPK-SC. The VHEM-H3M 
algorithm capitalizes on the first stage of song-level H3M estimation (about one third of 
the total time) by efficiently and effectively using the song-level H3Ms to learn the final tag 
models. Note that the runtime of PPK-SC corresponds to setting 2£w = 1. When we set 
= 2, we registered a running time four times longer, with no significant improvement 
in performance. 

The gain in computational efficiency does not negatively affect the quality of the cor- 
responding models. On the contrary, VHEM-H3M achieves better performance than EM- 
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H3M 14 strongly improving the top of the ranked lists, as evinced by the higher P@k scores. 
Relative to EM-H3M, VHEM-H3M has the benefit of regularization, and during learning 
can efficiently leverage all the music data condensed in the song H3Ms. VHEM-H3M also 
outperforms both PPK-SC approaches on all metrics. PPK-SC discards considerable in- 
formation on the clusters' structure by selecting one of the original HMMs to approximate 
each cluster. This significantly affects the accuracy of the resulting annotation models. 
VHEM-H3M, on the other hand, generates novel HMM cluster centers to summarize the 
clusters. This allows to retain more accurate information in the final annotation models. 

PPK-SC-hybrid achieves considerable improvements relative to standard PPK-SC, at 
relatively low additional computational costs j^] This further demonstrates that the VHEM- 
H3M algorithm can effectively summarize in a smaller model the information contained in 
several HMMs. In addition, we observe that VHEM-H3M still outperforms PPK-SC-hybrid, 
suggesting that the former produces more accurate cluster centers and density estimates. 
In fact, VHEM-H3M couples clustering and learning HMM cluster centers, and is entirely 
based on maximum likelihood for estimating the H3M annotation models. PPK-SC-hybrid, 
on the contrary, separates clustering and parameter estimation, and optimizes them against 
two different metrics (i.e., PPK similarity and expected log-likelihood, respectively). As a 
consequence, the two phases may be mismatched, and the centers learned with VHEM may 
not be the best representatives of the clusters according to PPK affinity. 

Finally, VHEM-H3M compares favorably to the auto-taggers based on other generative 
models. First, VHEM-H3M outperforms HEM-GMM, which does not model temporal 
information in the audio signal, on all metrics. Second, the performances of VHEM-H3M 
and HEM-DTM (a continuous-state temporal model) are not statistically different based 
on a paired t-test with 95% confidence, except for annotation precision where VHEM- 
H3M scores significantly higher. Since HEM-DTM is based on linear dynamic systems 
(a continuous-state model), it can model stationary time-series in a linear subspace. In 
contrast, VHEM-H3M uses HMMs with discrete states and GMM emissions, and can hence 
better adapt to non-stationary time-series on a non-linear manifold. This difference is 
illustrated in the experiments: VHEM-H3M outperforms HEM-DTM on the human MoCap 
data (see Table [2]), which has non-linear dynamics, while the two perform similarly on the 
music data (see Table [2]), where audio features are often stationary over short time frames. 

6.2 On-line hand-writing data classification and retrieval 

In this experiment, we investigate the performance of the VHEM-H3M algorithm in es- 
timating class-conditional H3M distributions for automatic classification and retrieval of 
on-line hand-writing data. 

6.2.1 Dataset 



We consider the Character Trajectories Data Set (Frank and Asuncion, 2010), which is a 



collection of 2858 examples of characters from the same writer, originally compiled to study 



14. The differences in performance are statistically significant based on a paired t-test with 95% confidence. 

15. In PPK-SC-hybrid, each run of the VHEM-H3M algorithm converges quickly since there is only one 
HMM component to be learned, and can benefit from clever initialization (i.e., to the HMM mapped the 
closest to the spectral clustering center). 
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handwriting motion primitives (Williams et al. 2006). Each example is the trajectory of 



one of the 20 different characters that correspond to a single pen-down segment. The data 
was captured from a WACOM tablet at 200 Hz, and consists of (x, y)-coordinates and pen 



tip force. The data has been numerically differentiated and Gaussian smoothed (Williams 



et al. 2006). Half of the data is used for training, with the other half held out for testing. 



6.2.2 Classification models and setup 

From the hand- writing examples in the training set, we estimate a series of class-conditional 
H3M distributions, one for each character, using hierarchical estimation with the VHEM- 
H3M algorithm. First, for each character, we partition all the relevant training data into 
groups of 3 sequences, and learn a HMM (with 5 = 4 states and M = 1 component for 
the GMM emissions) from each group using the Baum- Welch algorithm. Next, we estimate 
the class-conditional distribution (classification model) for each character by aggregating all 
the relevant HMMs and summarizing them into a H3M with K = 4 components using the 



VHEM-H3M algorithm (r = 10 and N = N V N C , where N v = lOj^jand N c is the number of 
intermediate models for that character). Using the character- level H3Ms and Bayes' rule, 
for each hand-writing example in the test set we compute the posterior probabilities of 
all of the 20 characters. Each example is classified as the character with largest posterior 
probability. For retrieval given a query character, examples in the test set are ranked by 
the character's posterior probability. 

We repeated the same experiment using PPK-SC or SHEM-H3M (r = 10, N = 1000) to 
estimate the classification models from the intermediate HMMs. Finally, we considered the 
EM-H3M algorithm, which directly uses the training sequences to learn the class-conditional 
H3M (K = 4). 

Since VHEM-H3M, SHEM-H3M and EM-H3M are iterative algorithms, we studied them 
when varying the stopping criterion. In particular, the algorithms were terminated when the 
relative variation in the value of the objective function between two consecutive iterations 
was lower than a threshold ALL, which we varied in {10~ 2 , 10" 3 , 10~ 4 , 1O~ 5 }0 

Finally, we measure classification and retrieval performance on the test set using the 
classification accuracy, and the average per-tag P@3 and P@5. We also report the total 
training time, which includes the time used to learn the intermediate HMMs. The experi- 
ments consisted of 5 trials with different random initializations of the algorithms. 

6.2.3 Results 

Table [5] lists the classification and retrieval performance on the test set for the various 
methods. Consistent with the experiments on music annotation and retrieval (Section |6.1[), 



16. Note that choosing a lower value of N v (compared to the music experiments) plays a role in making 
the clustering algorithm more reliable. Using fewer virtual samples equates to attaching smaller "virtual 
probability masses" to the input HMMs, and leads to less certain assignments of the input HMMs to the 



clusters (cf. equation 441. This determines more mixing in the initial iterations of the algorithm (e.g., 
similar to higher annealing temperature), and reduces the risk of prematurely specializing any cluster to 
one of the original HMMs. This effect is desirable, since the input HMMs are estimated over a smaller 
number of sequences (compared to the music experiments) and can therefore be noisier and less reliable. 
17. In a similar experiment where we used the number of iterations as the stopping criterion, we registered 
similar results. 
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Table 5: Classification and retrieval performance, and training time on the Character Tra- 
jectories Data Set, for VHEM-H3M, PPK-SC, SHEM-H3M, and EM-H3M. 



stop retrieval classification 





ALL 


P@3 


P@5 


Accuracy 


total time (s) 




i n~ 
±u 


-2 


0.750 


0.750 


0.618 


1838.34 




10" 


-3 


n 7i 7 


n 7^0 


n fii q 


1 Qfi7 ^ 


VHEM-H3M 


10" 


-1 


0.733 


0.790 


0.648 


2210.77 




10" 


-5 


0.750 


0.820 


0.651 


2310.93 




10" 


-2 


0.417 


0.440 


0.530 


13089.32 




10" 


-3 


0.683 


0.680 


0.664 


23203.44 


SHEM-H3M 


10" 


-1 


0.700 


0.750 


0.689 


35752.20 




10" 


-5 


0.700 


0.750 


0.690 


50094.36 




10" 


-2 


0.583 


0.610 


0.646 


6118.53 




10" 


-3 


0.617 


0.650 


0.674 


7318.56 


EM-H3M 


10" 


-4 


0.650 


0.710 


0.707 


9655.08 




10" 


-5 


0.517 


0.560 


0.635 


10957.38 


PPK-SC 






0.600 


0.700 


0.646 


1463.54 



VHEM-H3M performs better than PPK-SC on all metrics. By learning novel HMM cluster 
centers, VHEM-H3M estimates H3M distributions that are representative of all the relevant 
intermediate HMMs, and hence of all the relevant training sequences. While EM-H3M is the 
best in classification (at the price of longer training times), VHEM-H3M performs better in 
retrieval, as evinced by the P@3 and P@5 scores. In terms of training time, VHEM-H3M 
and PPK-SC are about 5 times faster than EM-H3M. In particular, PPK-SC is the fastest 
algorithm, since the small number of input HMMs (i.e., on average 23 per character) allows 
to build the spectral clustering embedding efficiently. 

The version of HEM based on actual sampling (SHEM-H3M) performs better than 
VHEM-H3M in classification, but VHEM-H3M has higher retrieval scores. However, the 
training time for SHEM-H3M is approximately 15 times longer than for VHEM-H3M. In or- 
der to reliably estimate the reduced models, the SHEM-H3M algorithm requires generating 
a large number of samples, and computing their likelihood at each iteration. In contrast, 
the VHEM-H3M algorithm efficiently approximates the sufficient statistics of a virtually 
unlimited number of samples, without the need of using real samples. 

It is also interesting to note that EM-H3M appears to suffer from overfitting of the 
training set, as suggested by the overall drop in performance when the stopping criterion 
changes from ALL = 10~ 4 to ALL = 10" 5 . In contrast, both VHEM-H3M and SHEM- 
H3M consistently improve on all metrics as the algorithm converges (again looking at ALL £ 
{10~ 4 , 10~ 5 }). These results suggest that the regularization effect of hierarchical estimation, 
which is based on averaging over more samples (either virtual or actual), can positively 
impact the generalization of the learned models. 18 



18. For smaller values of ALL (e.g., ALL < 10 5 ), the performance of EM-H3M did not improve. 
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Table 6: Annotation and retrieval performance on CAL500 for VHEM-H3M when varying 
the virtual sample parameters N v and r. 









N v = 


1000 






T 


= 10 








T = 2 


T = 5 


T = 10 


r = 20 


N v = 1 


N v = 10 


N v = 100 


N v = 1000 


P( 


35 


0.4656 


0.4718 


0.4738 


0.4734 


0.4775 


0.4689 


0.4689 


0.4738 


PC 


ilO 


0.4437 


0.4487 


0.4507 


0.4478 


0.4534 


0.4491 


0.4466 


0.4507 


PC 


il5 


0.4236 


0.4309 


0.4346 


0.4327 


0.4313 


0.4307 


0.4242 


0.4346 



Finally, we elaborate on how these results compare to the experiments on music annota- 



tion and retrieval (in Section 6.1). First, in the Character Trajectory Data Set the number 
of training sequences associated with each class (i.e, each character) is small compared to 
the CAL500 dataset As a result, the EM-H3M algorithm is able to process all the data, 
and achieve good classification performance. However, EM-H3M still needs to evaluate the 
likelihood of all the original sequences at each iteration. This leads to slower iterations, 
and results in a total training time about 5 times longer than that of VHEM-H3M (see 
Table [5]). Second, the Character Trajectory data is more "controlled" than the CAL500 
data, since each class corresponds to a single character, and all the examples are from the 
same writer. As a consequence, there is less variation in the intermediate HMMs (i.e., they 
are clustered more closely) , and several of them may summarize the cluster well, providing 
good candidate cluster centers for PPK-SC. In conclusion, PPK-SC faces only a limited 
loss of information when selecting one of the initial HMMs to represent each cluster, and 
achieves reasonable performances. 



6.3 Robustness of VHEM-H3M to number and length of virtual samples 

The generation of virtual samples in VHEM-H3M is controlled by two parameters: the 
number of virtual sequences (N), and their length (r). In this section, we investigate the 
impact of these parameters on annotation and retrieval performance on CAL500. For a 
given tag t, we set N = N v N t K<-'\ where N v is a constant, Nt the number of training songs 
for the tag, and the number of mixture components for each song-level H3M. Starting 
with (N v ,t) = (1000, 10), each parameter is varied while keeping the other one fixed, and 
annotation and retrieval performance on the CAL500 dataset are calculated, as described 
in Section l6~Tl 

Table [6] presents the results, for r £ {2,5,10,20} and N v G {1,10,100,1000}. The 
performances when varying r are close on all metrics. For example, average P@5, P@10 
and P@15 vary in small ranges (0.0082, 0.0070 and 0.0110, respectively). Similarly, varying 
the number of virtual sequences N v has a limited impact on performance as well 
demonstrates that VHEM-H3M is fairly robust to the choice of these parameters. 



20 



This 



19. 



20. 



In the Character Trajectory dataset there are on average 71 training sequences per character. In CAL500, 
each tag is associated with thousands of training sequences at the song level (e.g., an average of about 
8000 audio fragments per tag). 

Note that the E-step of the VHEM-H3M algorithm averages over all possible observations compatible 
with the input models, also when we choose a low value of N v (e.g., N v — 1). The number of virtual 
samples controls the "virtual mass" of each input HMMs and thus the certainty of cluster assignments. 



35 



Coviello, Chan, Lanckriet 



Finally, we tested VHEM-H3M for music annotation and retrieval on CAL500, using 
virtual sequences of the same length as the audio fragments used at the song level, i.e., 
t = T = 125. Compared to r = 10 (the setting used in earlier experiments), we registered 
an 84% increase in total running time, with no corresponding improvement in performance. 
Thus, in our experimental setting, making the virtual sequences relatively short positively 
impacts the running time, without reducing the quality of the learned models. 



7. Conclusion 

In this paper, we presented a variational HEM (VHEM) algorithm for clustering HMMs 
with respect to their probability distributions, while generating a novel HMM center to 
represent each cluster. Experimental results demonstrate the efficacy of the algorithm on 
various clustering, annotation, and retrieval problems involving time-series data, showing 
improvement over current methods. In particular, using a two-stage, hierarchical estimation 
procedure — learn H3Ms on many smaller subsets of the data, and summarize them in a 
more compact H3M model of the data — the VHEM-H3M algorithm estimates annotation 
models from data more efficiently than standard EM and also improves model robustness 
through better regularization. Specifically, averaging over all possible virtual samples pre- 
vents over-fitting, which can improve the generalization of the learned models. Moreover, 
using relatively short virtual sequences positively impacts the running time of the algorithm, 
without negatively affecting its performance on practical applications. In addition, we have 
noted that the VHEM-H3M algorithm is robust to the choice of the number and length of 
virtual samples. 

In our experiments, we have implemented the first stage of the hierarchical estimation 
procedure by partitioning data in non- overlapping subsets (and learning an intermediate 
H3M on each subset). In particular, partitioning the CAL500 data at the song level had 
a practical advantage. Since individual songs in CAL500 are relevant to several tags, the 
estimation of the song H3Ms can be executed one single time for each song in the database, 
and re-used in the VHEM estimation of all the associated tag models. This has a positive 
impact on computational efficiency. Depending on the particular application, however, a 
slightly different implementation of this first stage (of the hierarchical estimation procedure) 
may be better suited. For example, when estimating a H3M from a very large amount of 
training data, one could use a procedure that does not necessarily cover all data, inspired by 



Kleiner et al. (2011). If n is the size of the training data, first estimate B > 1 intermediate 



H3Ms on as many (possibly overlapping) "little" bootstrap subsamples of the data 21 each 
of size b < n. Then summarize all the intermediate H3Ms into a final H3M using the 
VHEM-H3M algorithm. 

In future work we plan to extend VHEM-H3M to the case where all HMMs share a 
large GMM universal background model for the emission distributions (with each HMM 
state having a different set of weights for the Gaussian components), which is commonly 



used in speech (Huang and Jack, 1989 Bellegarda and Nahamoo, 1990; Rabiner and Juang 



1993). This would allow for faster training (moving the complexity to estimating the noise 



background model) and would require a faster implementation of the inference (e.g., using a 



21. Several techniques have been proposed to bootstrap from sequences of samples, for example ( |Hall et al 



1995) 
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strategy similar to (Coviello et al. , 2012b)). In addition, we plan to derive a HEM algorithm 



for HMMs with discrete emission distributions, and compare its performance to the work 
presented here and to the extension with the large GMM background model. 
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Appendix A. Derivation of the E-step 



The maximization of (42) with respect to ^ (pt\pt-i, fit) and (fti {p\\f3i) is carried out 
independently for each pair and follow (Hershey et al. |2008 ). In particular it uses a 

backward recursion, starting with C l ^ +1 {j3t-, pt) = 0, for t = r, . . . , 2, 



W(jH\pt-l,Pt) 
and terminates with 

0i j (pii/3i) 



rh3 



a pt -, pt exp {4^' ij ' Pt) + Pt)} 



V 5 n (r)j ' exnlr Mt) ' (i ' p) 
/ jp «p t _i,p exp s i~ GMM 



4- £ M 
t L -t+ 



iWt, P )} 



(65) 
(66) 



p=l 



7 rM J exp{4 M ^'^) + £^(/3 1 ,p 1 )} 



£7r«-*log5>«"-exp{4i 



(r),j ( Ai,M,(j,P) , r 

Z <p n p ex P I '-GMM * 1 



p=i 



Hp,p)} 



(67) 
(68) 



where (68) is the maxima of the terms in (42) in Section 3.3.1 



Appendix B. Derivation of the M-step 



The M-steps involves maximizing the lower bound in (33) with respect to MS r \ while 
holding the variational distributions fixed, 



argmax C 
MM i=1 



H3M- 



(69) 
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Substituting (45) and (46) into the objective function of (69), 



(70) 



8=1 



UJ 



('■) 



£ hi { log -j- + N, 22 Tfg; £ <^W|/M 

^ /3l:r Pl:r 



',.7 



]o o ?fe + ^ Ai,M,(j,Pt) 

10 & -?.•_•/ , „ , + 2^ GMM 



(71) 



In the following, we detail the update rules for the parameters of the reduced model MS T \ 



HMMs mixture weights 



Collecting terms in (71) that only depend on the mixture weig hts {u;f }fj x , we have 



£({^}) = EE% lo ^f = E 



(r) 



logo; 



(r) 



(72) 



Given the constraints Y^f=i = 1 an( l l fj_ > — 0> (^2) is maximized using the result in 
Appendix C.l, which yields the update in (55). 



,(0 



Initial state probabilities 



The objective function in (|71|) factorizes for each HMM A4j , and hence the parameters of 



depend on the initial state probabilities {ftp ,J }p =1 , 



each HMM are updated independently. For the j-th HMM, we collect terms in ( 71 ) that 

<p= 



i Pi Pi 



r (r)j 
Pi 



pi % 



»<\Pl) 



E 



E% w i ^(p) 



(73) 
(74) 

(75) 
(76) 



where in the (75) we have used the summary statistic defined in (48). Considering the 



C.l, giving the update formula in (56) 



constraints 5^f=i ftp = 1 anc l > 0, (76) is maximized using the result in Appendix 
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State transition probabilities 



(r) - 

Similarly, for each HMM M - and previous state p, we collect terms in (|71|) that depend 
on the transition probabilities {a^'/}y =1 , 



i Pl-.T Pl:T 



(77) 
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(80) 
(81) 
(82) 



(83) 

(84) 
(85) 

(86) 
(87) 



Considering the constraints $ZS=i c^lj = 1 an d > 0, (87) is maximized using the 



jp/=i p,p' - ™rr ~p,p 
result in Appendix C.l, giving the update in (56). 
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Emission probability density functions 



The cost function (71) factors also for each GMM indexed by (j,p,£). Factoring (71), 
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(96) 

(97) 
(98) 



where in ( 98 ) we use the weighted-sum operator defined in ( 59 ) , which is over all base model 



GMMs {.M?2 }. The GMM mixture weights are subject to the constraints YleLi = 1> 
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Vj, p. Taking the derivative with respect to each parameter and setting it to zerc , gives 



the GMM update equations (57) and (58) 



Appendix C. Useful optimization problems 
Appendix C.l 

The optimization problem 

L L 

Y,Peloga e s.t. J^ = l, a e >0, W (99) 



max 



is optimized by at 



i-lt'—l p t 

Appendix C.2 

The optimization problem 
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^at(Pt -logon) s.t. Y, ae = 1 a ^°> W ( 10 °) 



is optimized by a* t = £ cxp f x f Dfl , - 
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